on 12-16-2022 10:38 AM - edited on 06-30-2023 12:26 AM by greatdevaks
This article is co-authored by @greatdevaks and @sampriyadarshi.
Many enterprises are adopting Apigee Hybrid because of their use-cases like operating in regulated industries, ability to manage the runtime on their own, be it on-premises or in any other cloud where their APIs reside. Cassandra is one of the most important components of Apigee Hybrid. This article describes some of the most common Cassandra operations that you likely will have to perform if you are running Apigee Hybrid.
Apache Cassandra is the runtime datastore that provides data persistence for the Apigee Hybrid Runtime Plane.
You deploy the Cassandra database as a StatefulSet on your Kubernetes cluster. Persistent Volumes (PV) are used to store the data.
The Cassandra database stores information about the following entities:
Diagram 1: Apigee Hybrid Runtime Plane Components
Oftentimes people use only a single node pool while running Apigee Hybrid. This can quickly become a problem because Cassandra pods might demand more resources and not get scheduled which can lead to your Apigee Hybrid setup performing abnormally. That is why it is very important to assign a dedicated node pool for Cassandra workloads.
Cassandra is the only stateful component in Apigee Hybrid Runtime and all the other components are stateless, so it’s a good idea to dedicate a separate node pool for stateless components and stateful components (in this case Cassandra). Also, Cassandra is a resource intensive service and should not be deployed on a node with any other Apigee Hybrid service.
You can configure Apigee Hybrid to assign a dedicated node pool for Cassandra by specifying the nodeSelector property as shown below:
cassandra:
nodeSelector:
...
key: "cloud.google.com/gke-nodepool" # Key for GKE specifically; key name would change based on the system hosting Kubernetes, as per the node labels provided.
value: "apigee-data"
...
nodeSelector:
requiredForScheduling: true
apigeeRuntime:
key: "cloud.google.com/gke-nodepool"
value: "apigee-runtime"
apigeeData:
key: "cloud.google.com/gke-nodepool"
value: "apigee-data"
If cassandra.nodeSelector.key and cassandra.nodeSelector.value are set, then the values specified by them override the ones specific in nodeSelector.apigeeData.
You need to make sure that nothing else runs on the node pool dedicated for Cassandra workloads. If this is not taken care of, then it is very likely for other pods to get scheduled on the node pool dedicated for Cassandra and when the need arises for scheduling Cassandra pods, there might not be enough resources available on the node pool to schedule the same.
If for some reason you set the nodeSelector.requiredForScheduling property, mentioned in the previous point, as false, taints and tolerations can be utilized to ensure that the apigee-data node pool always runs only the Cassandra workloads and nothing else.
Before configuring the tolerations in your overrides file, you will have to taint the nodes which are supposed to run Cassandra:
kubectl taint nodes <node_name> workload=cassandra-datastore:NoSchedule
Once you have tainted the nodes, you can specify the cassandra.tolerations property in your overrides file like below:
cassandra:
...
tolerations:
- key: "workload"
Operator: "Equal"
Value: "cassandra-datastore"
Effect: "NoSchedule"
...
As you deploy more and more API Proxies on Apigee and start to receive a good amount of traffic on your API Proxies, it becomes very important to scale up your Apigee Hybrid components so that they can serve a larger amount of traffic. All Apigee Hybrid components support autoscaling except Cassandra. That is why it becomes very important to take care of scaling Cassandra as per demand, appropriately.
Apigee Hybrid is configured to run one Cassandra pod per Kubernetes worker node. There is a podAntiAffinity property which is set which directs this behavior. This behaviour should be kept in mind when scaling Cassandra as this will help reserve sufficient capacity for provisioning the new set of Cassandra replicas.
In Apigee Hybrid, Cassandra is not configured for horizontal or vertical pod autoscaling, so the scaling has to be done manually.
apigee-data node pool must have additional capacity when scaling up Cassandra horizontally. If there is no additional capacity while performing the horizontal scaling, then the Cassandra pods will go in Pending state.
cassandra.replicaCount configuration property should be tweaked to scale the Cassandra pods horizontally, as shown below. The replication factor for Cassandra is three and scaling (up or down) should be done by a factor of three.
cassandra:
...
replicaCount: "9"
...
Depending on the traffic load, you might want to scale down the number of Cassandra nodes (essentially all the Cassandra nodes of a cluster are connected together in the form of a Cassandra Ring). Utmost caution should be taken when scaling down Cassandra because it's more invasive than scaling up. If any node other than the nodes to be decommissioned is unhealthy, do not proceed with downscaling. Kubernetes will not be able to downscale the pods in the cluster.
Before proceeding with the downscale operation, determine if the Cassandra cluster (after the scale down operation, the remaining active Cassandtra nodes will see a bump in their respective storage usage) has enough storage to support downscaling. After scaling down, the active Cassandra nodes should have no more than 75% of their storage utilized.
For example, as shown in Table 1 below, if your cluster has six Cassandra nodes and they are all approximately 50% utilized on storage, downscaling to three nodes would result in having all the three active nodes reach 100% storage utilization, which would not leave any room for continued operations.
If however, you have nine Cassandra nodes, all approximately 50% utilized on storage, downscaling to six nodes would leave each remaining node at 75% storage utilization. Because of this headroom of 25%, the Cassandra cluster can be scaled down from nine to six nodes.
Once the scale down operation is complete, the Persistent Volume Claims and Persistent Volumes associated with Cassandra must be cleaned up manually.
Table 1: Downscaling impact on overall Cassandra Cluster’s storage utilization
Previous Number of Cassandra Nodes |
Updated Number of Cassandra nodes |
Previous Storage Utilization |
Updated Storage Utilization |
6 |
3 |
50% |
100% |
9 |
6 |
50% |
75% |
As stated earlier, only one Cassandra pod can be deployed on a Kubernetes worker node because of podAntiAffinity. If the Cassandra pods have to be vertically scaled, the same can be achieved through replacement. Vertical scaling of Cassandra pods requires additional nodes. Best practice is to create a totally separate node pool based on the new Cassandra CPU and Memory specifications and then deploy the instances there. Below is how the configuration looks like for the above process:
nodeSelector:
requiredForScheduling: true
apigeeData:
key: "cloud.google.com/gke-nodepool"
value: "apigee-data-new"
cassandra:
resources:
requests:
cpu: 14
memory: 16Gi
The Apigee Hybrid Cassandra component uses Persistent Volumes to store data. The size of the Persistent Volume is defined during installation and initial configuration. This initial storage size is an immutable value and cannot be changed. Therefore, any new node added to the cluster will use the same Persistent Volume size.
It is possible to increase the size of the existing Persistent Volume by making the changes directly on the Persistent Volume Claim, but new nodes that get provisioned will use the smaller initial Persistent Volume size.
To overcome this one should follow the below-mentioned procedure which helps in expanding storage capacity for the existing volumes and also allow the new nodes to expand their Persistent Volume:
You can find the detailed steps for expanding Cassandra Persistent Volumes here.
It’s recommended to run Apigee Hybrid in multiple regions/datacenters, at least two if not more, for ensuring high availability. This makes sure that your API Proxies are available without any downtime, even if one of the regions is down, if you have taken care of other configurations like DNS etc.
While setting up Apigee Hybrid in multiple regions, Cassandra should be configured very carefully because the data has to be replicated between all the Apigee Runtime Instances in different regions/datacenters. If replication is not configured properly between Cassandra, you might not be able to deploy or access your API Proxies.
In order for the Cassandra clusters in different Kubernetes clusters to connect with each other and start replicating data, you should satisfy the below-mentioned requirements:
You can deploy your Apigee Organization in up to 10 regions/datacenters and have a maximum of 150 Cassandra pods across the same.
The architecture below shows how two Apigee Hybrid Runtime Instances are set up in two regions/datacenters, with only the Cassandra pods communicating with each other over ports 7000/7001. All other components work independently.
Diagram 2: Apigee Hybrid Multi-Region/Multi-Datacenter Deployment
You can check the status of the Cassandra clusters configured in two regions by running the below-mentioned command:
kubectl exec apigee-cassandra-default-0 -n apigee -- nodetool -u jmxuser -pw <password> status
Output of the nodetool command is shown below:
Diagram 3: Apigee Cassandra nodetool Output for Multi-Region/Multi-Datacenter
Suppose you are running Cassandra in a single region and it goes down because of some issues and you want to restore it and make sure that the existing data is available. Since Cassandra stores data about your Apigee Hybrid Runtime, it is very important to take backups in case you need to perform restoration of Cassandra data in future because of various reasons (like failure of the only available region).
Backup for Cassandra is not enabled by default. You can specify the backup configuration and its schedule in the overrides file.
There are two backup storage backends which you can choose from:
The below configuration shows Google Cloud Storage as backend for Cassandra backup:
cassandra:
backup:
enabled: true
serviceAccountPath: "./cass-backup-sa.json"
dbStorageBucket: "gs://cass-backups"
schedule: "45 23 * * 6"
Your Cassandra backups will appear like below in Cloud Storage:
Diagram 4: Apigee Cassandra Backups in Google Cloud Storage
Refer this link to configure a Remote Server as the Cassandra backup storage backend.
You can use backups to restore the Apigee Hybrid Runtime in case of any disaster.
Don’t use backups for the below reasons:
It’s important to note that you will incur some downtime during the restoration process and there will be data loss for the period between the latest successful backup and the time the restoration completes.
When restoring, keep the below-mentioned points in mind:
To summarize, we looked at how to schedule Cassandra pods on dedicated node pools and why it’s important to do capacity planning beforehand for reliably running your Cassandra workloads. We also looked at how to handle Cassandra scaling as it doesn’t support horizontal and vertical pod autoscaling, like other Apigee Hybrid components. Lastly, we looked at how to configure Cassandra for running Apigee Hybrid in multiple regions/datacenters to achieve high availability.
To go forward from here, it is highly recommended to go through the Apigee Hybrid Cassandra documentation to get more details on some of the other configurations that you might find fit for your use case in order to better run Apigee Hybrid in your environment.
Stay tuned for more advanced parts which are going to cover some more Cassandra operations.
Thanks @ncardace and @yuriyl for reviewing the draft of this blog.
@sampriyadarshi any plan to release part 2
@aramkrishna6 Yes, it's in draft. Should be released soon.
@sampriyadarshi Appreciate your action on request, if you can add more details on Cassandra ring in multi cloud and with on prem and Cassandra back up and recovery with CSI (with public cloud ) and without CSI (on premise for example) . In addition to what's documented in public and more details for Cassandra password reset
@sampriyadarshi Please let us know the link for updated article (Part-2)
@aramkrishna6 here is the link of Part 2: https://www.googlecloudcommunity.com/gc/Cloud-Product-Articles/Advanced-Apigee-Hybrid-Cassandra-Oper...
As always your feedback is always appreciated.