Advanced Apigee Hybrid Cassandra Operations - Part 1

Overview

This article is co-authored by @greatdevaks and @sampriyadarshi.

Many enterprises are adopting Apigee Hybrid because of their use-cases like operating in regulated industries, ability to manage the runtime on their own, be it on-premises or in any other cloud where their APIs reside. Cassandra is one of the most important components of Apigee Hybrid. This article describes some of the most common Cassandra operations that you likely will have to perform if you are running Apigee Hybrid.

Where does Cassandra fit in Apigee Hybrid?

Apache Cassandra is the runtime datastore that provides data persistence for the Apigee Hybrid Runtime Plane.

You deploy the Cassandra database as a StatefulSet on your Kubernetes cluster. Persistent Volumes (PV) are used to store the data.

The Cassandra database stores information about the following entities:

  • Key Management System (KMS)
  • Key Value Map (KVM)
  • Response Cache
  • OAuth
  • Quotas
  • And many more…refer here for details.

 

sampriyadarshi_0-1671213799895.png

Diagram 1: Apigee Hybrid Runtime Plane Components

Common Operations

Node Pool and Node Selector

Oftentimes people use only a single node pool while running Apigee Hybrid. This can quickly become a problem because Cassandra pods might demand more resources and not get scheduled which can lead to your Apigee Hybrid setup performing abnormally. That is why it is very important to assign a dedicated node pool for Cassandra workloads.

Cassandra is the only stateful component in Apigee Hybrid Runtime and all the other components are stateless, so it’s a good idea to dedicate a separate node pool for stateless components and stateful components (in this case Cassandra). Also, Cassandra is a resource intensive service and should not be deployed on a node with any other Apigee Hybrid service.

You can configure Apigee Hybrid to assign a dedicated node pool for Cassandra by specifying the nodeSelector property as shown below:

cassandra:
  nodeSelector:
    ...
    key: "cloud.google.com/gke-nodepool" # Key for GKE specifically; key name would change based on the system hosting Kubernetes, as per the node labels provided.
    value: "apigee-data"
    ...
 
The nodeSelector config section has a property called requiredForScheduling. For production environments, nodeSelector.requiredForScheduling should be set to true. If set to false (default), underlying pods can be scheduled on any unoccupied node belonging to any node pool (not specifically the node pool dedicated for Cassandra). Setting the value of the property to true makes sure that the Cassandra pods land on the node pool dedicated for Cassandra.
 
nodeSelector:
  requiredForScheduling: true
    apigeeRuntime:
      key: "cloud.google.com/gke-nodepool"
      value: "apigee-runtime"
    apigeeData:
      key: "cloud.google.com/gke-nodepool"
      value: "apigee-data"​

If cassandra.nodeSelector.key and cassandra.nodeSelector.value are set, then the values specified by them override the ones specific in nodeSelector.apigeeData.

Taints and Tolerations

You need to make sure that nothing else runs on the node pool dedicated for Cassandra workloads. If this is not taken care of, then it is very likely for other pods to get scheduled on the node pool dedicated for Cassandra and when the need arises for scheduling Cassandra pods, there might not be enough resources available on the node pool to schedule the same.

If for some reason you set the nodeSelector.requiredForScheduling property, mentioned in the previous point, as false, taints and tolerations can be utilized to ensure that the apigee-data node pool always runs only the Cassandra workloads and nothing else.

Before configuring the tolerations in your overrides file, you will have to taint the nodes which are supposed to run Cassandra:

kubectl taint nodes <node_name> workload=cassandra-datastore:NoSchedule

Once you have tainted the nodes, you can specify the cassandra.tolerations property in your overrides file like below:

 

cassandra:
  ...
  tolerations:
  - key: "workload"
    Operator: "Equal"
    Value: "cassandra-datastore"
    Effect: "NoSchedule"
  ...​

Scaling Cassandra StatefulSet

As you deploy more and more API Proxies on Apigee and start to receive a good amount of traffic on your API Proxies, it becomes very important to scale up your Apigee Hybrid components so that they can serve a larger amount of traffic. All Apigee Hybrid components support autoscaling except Cassandra. That is why it becomes very important to take care of scaling Cassandra as per demand, appropriately.

Apigee Hybrid is configured to run one Cassandra pod per Kubernetes worker node. There is a podAntiAffinity property which is set which directs this behavior. This behaviour should be kept in mind when scaling Cassandra as this will help reserve sufficient capacity for provisioning the new set of Cassandra replicas.

In Apigee Hybrid, Cassandra is not configured for horizontal or vertical pod autoscaling, so the scaling has to be done manually.

Manual Horizontal Scaling

apigee-data node pool must have additional capacity when scaling up Cassandra horizontally. If there is no additional capacity while performing the horizontal scaling, then the Cassandra pods will go in Pending state.

cassandra.replicaCount configuration property should be tweaked to scale the Cassandra pods horizontally, as shown below. The replication factor for Cassandra is three and scaling (up or down) should be done by a factor of three.

cassandra:
  ...
  replicaCount: "9"
  ...​

Depending on the traffic load, you might want to scale down the number of Cassandra nodes (essentially all the Cassandra nodes of a cluster are connected together in the form of a Cassandra Ring). Utmost caution should be taken when scaling down Cassandra because it's more invasive than scaling up. If any node other than the nodes to be decommissioned is unhealthy, do not proceed with downscaling. Kubernetes will not be able to downscale the pods in the cluster.

Before proceeding with the downscale operation, determine if the Cassandra cluster (after the scale down operation, the remaining active Cassandtra nodes will see a bump in their respective storage usage) has enough storage to support downscaling. After scaling down, the active Cassandra nodes should have no more than 75% of their storage utilized.

For example, as shown in Table 1 below, if your cluster has six Cassandra nodes and they are all approximately 50% utilized on storage, downscaling to three nodes would result in having all the three active nodes reach 100% storage utilization, which would not leave any room for continued operations.

If however, you have nine Cassandra nodes, all approximately 50% utilized on storage, downscaling to six nodes would leave each remaining node at 75% storage utilization. Because of this headroom of 25%, the Cassandra cluster can be scaled down from nine to six nodes.

Once the scale down operation is complete, the Persistent Volume Claims and Persistent Volumes associated with Cassandra must be cleaned up manually.

Table 1: Downscaling impact on overall Cassandra Cluster’s storage utilization

Previous Number of Cassandra Nodes

Updated Number of Cassandra nodes

Previous Storage Utilization

Updated Storage Utilization

6

3

50%

100%

9

6

50%

75%

Manual Vertical Scaling

As stated earlier, only one Cassandra pod can be deployed on a Kubernetes worker node because of podAntiAffinity. If the Cassandra pods have to be vertically scaled, the same can be achieved through replacement. Vertical scaling of Cassandra pods requires additional nodes. Best practice is to create a totally separate node pool based on the new Cassandra CPU and Memory specifications and then deploy the instances there. Below is how the configuration looks like for the above process: 

nodeSelector:
  requiredForScheduling: true
    apigeeData:
      key: "cloud.google.com/gke-nodepool"
      value: "apigee-data-new"
cassandra:
  resources:
    requests:
      cpu: 14
      memory: 16Gi​
 
The configuration above specifies the new node pool called apigee-data-new for the Cassandra pods. Once you apply this new overrides file, the Cassandra pods will start rolling over to this new node pool. Once everything is completed, make sure to perform a clean up of the old resources like node pools etc.

Storage Expansion for Existing Cassandra Pods

The Apigee Hybrid Cassandra component uses Persistent Volumes to store data. The size of the Persistent Volume is defined during installation and initial configuration. This initial storage size is an immutable value and cannot be changed. Therefore, any new node added to the cluster will use the same Persistent Volume size.

It is possible to increase the size of the existing Persistent Volume by making the changes directly on the Persistent Volume Claim, but new nodes that get provisioned will use the smaller initial Persistent Volume size.

To overcome this one should follow the below-mentioned procedure which helps in expanding storage capacity for the existing volumes and also allow the new nodes to expand their Persistent Volume:

  • Update the storage capacity in Persistent Volume Claim (PVC)
  • Backup and delete the StatefulSet; generates a YAML manifest for StatefulSet
  • When deleting StatefulSet, be careful and use --cascade=orphan option
  • Update the storage capacity in the backed-up YAML manifest file
  • Re-apply StatefulSet YAML manifest/configuration using kubectl
  • Update the overrides.yaml file with the new capacity and apply

You can find the detailed steps for expanding Cassandra Persistent Volumes here.

Highly Available Apigee Hybrid Setup

It’s recommended to run Apigee Hybrid in multiple regions/datacenters, at least two if not more, for ensuring high availability. This makes sure that your API Proxies are available without any downtime, even if one of the regions is down, if you have taken care of other configurations like DNS etc.

While setting up Apigee Hybrid in multiple regions, Cassandra should be configured very carefully because the data has to be replicated between all the Apigee Runtime Instances in different regions/datacenters. If replication is not configured properly between Cassandra, you might not be able to deploy or access your API Proxies.

In order for the Cassandra clusters in different Kubernetes clusters to connect with each other and start replicating data, you should satisfy the below-mentioned requirements:

  • Have non-overlapping CIDR ranges between the Cassandra pods
  • Have TCP port 7000 and 7001 whitelisted so that Cassandra pods can communicate with each other

You can deploy your Apigee Organization in up to 10 regions/datacenters and have a maximum of 150 Cassandra pods across the same.

The architecture below shows how two Apigee Hybrid Runtime Instances are set up in two regions/datacenters, with only the Cassandra pods communicating with each other over ports 7000/7001. All other components work independently.

 

sampriyadarshi_1-1671213800005.png

Diagram 2: Apigee Hybrid Multi-Region/Multi-Datacenter Deployment

You can check the status of the Cassandra clusters configured in two regions by running the below-mentioned command:

kubectl exec apigee-cassandra-default-0 -n apigee -- nodetool -u jmxuser -pw <password> status

Output of the nodetool command is shown below:

sampriyadarshi_2-1671213799842.png

Diagram 3: Apigee Cassandra nodetool Output for Multi-Region/Multi-Datacenter

Backup and Restore

Suppose you are running Cassandra in a single region and it goes down because of some issues and you want to restore it and make sure that the existing data is available. Since Cassandra stores data about your Apigee Hybrid Runtime, it is very important to take backups in case you need to perform restoration of Cassandra data in future because of various reasons (like failure of the only available region).

Backup for Cassandra is not enabled by default. You can specify the backup configuration and its schedule in the overrides file.

There are two backup storage backends which you can choose from:

  • Google Cloud Storage
  • Remote Server

The below configuration shows Google Cloud Storage as backend for Cassandra backup:

 

cassandra:
  backup:
    enabled: true
    serviceAccountPath: "./cass-backup-sa.json"
    dbStorageBucket: "gs://cass-backups"
    schedule: "45 23 * * 6"​

Your Cassandra backups will appear like below in Cloud Storage:

sampriyadarshi_3-1671213800110.png

Diagram 4: Apigee Cassandra Backups in Google Cloud Storage

Refer this link to configure a Remote Server as the Cassandra backup storage backend.

You can use backups to restore the Apigee Hybrid Runtime in case of any disaster.

Don’t use backups for the below reasons:

  • Cassandra node failures
  • Accidental deletion of data like Developers and API Products
  • One or more regions/datacenters going down in a multi-region/multi-datacenter setup

It’s important to note that you will incur some downtime during the restoration process and there will be data loss for the period between the latest successful backup and the time the restoration completes.

When restoring, keep the below-mentioned points in mind:

  • You can only restore the entire Cassandra data; no cherry-picking is allowed
  • Apigee Hybrid version should be the same for the new cluster and the old cluster
  • The number of Cassandra pods in the new cluster should be the same as that of the old cluster

Conclusion

To summarize, we looked at how to schedule Cassandra pods on dedicated node pools and why it’s important to do capacity planning beforehand for reliably running your Cassandra workloads. We also looked at how to handle Cassandra scaling as it doesn’t support horizontal and vertical pod autoscaling, like other Apigee Hybrid components. Lastly, we looked at how to configure Cassandra for running Apigee Hybrid in multiple regions/datacenters to achieve high availability.

To go forward from here, it is highly recommended to go through the Apigee Hybrid Cassandra documentation to get more details on some of the other configurations that you might find fit for your use case in order to better run Apigee Hybrid in your environment.

Stay tuned for more advanced parts which are going to cover some more Cassandra operations.

Thanks @ncardace and @yuriyl for reviewing the draft of this blog.

Advanced Apigee Hybrid Cassandra Operations Series

Comments
aramkrishna6
Bronze 5
Bronze 5

@sampriyadarshi  any plan to release part 2 

@aramkrishna6 Yes, it's in draft. Should be released soon.

aramkrishna6
Bronze 5
Bronze 5

@sampriyadarshi   Appreciate your action on request, if you can add more details on Cassandra ring  in multi cloud and with on prem and Cassandra back up and recovery with CSI (with public cloud ) and without CSI (on premise for example) . In addition to what's documented in public and more details for Cassandra password reset

aramkrishna6
Bronze 5
Bronze 5

@sampriyadarshi  Please let us know the link for updated article (Part-2)

@aramkrishna6 here is the link of Part 2: https://www.googlecloudcommunity.com/gc/Cloud-Product-Articles/Advanced-Apigee-Hybrid-Cassandra-Oper...

As always your feedback is always appreciated.

Version history
Last update:
‎06-30-2023 12:26 AM
Updated by: