Best Practices for Cassandra High Availability

Single Region Deployments

Apigee hybrid scales up to support high availability deployments across multiple regions. Scaling Apigee hybrid across multiple regions is our recommended pattern. Apigee hybrid is also capable of scaled down deployments to a single region. If you are running a Apigee hybrid deployment in a single region then we would like to discuss best practices to maintain high availability within a single region.

This document is intended for Apigee hybrid operators and administrators responsible for managing Cassandra deployments. We discuss the usage of the Cassandra snitch feature and how it may mitigate node failure issues in a three node Cassandra cluster deployed in a single region.

Node Recovery and Performance Considerations

Node recovery processes can have an impact on overall cluster performance. When nodes are taken offline for testing or maintenance, the subsequent recovery can lead to increased latency, reduced throughput and potential strain on other nodes. It's important to be aware of the time required for a node to rejoin the cluster and the potential performance effects during this period.

Exploring configuration options and strategies to manage these adjustments during node recovery is a valuable practice. It should be noted that node recovery time can vary depending on the data volume, hardware, network latency and Cassandra configuration.

Rack Configuration for High Availability

Rack configuration plays a vital role in achieving high availability in Cassandra deployments. In some setups, even with nodes distributed across regions, the loss of a single node might have broader impacts if rack configuration is not explicitly defined. Cassandra will distribute token ranges equally across nodes, so nodes need to be distributed equally across availability zones. In a single-region deployment, an availability zone is a virtual rack. Token range distribution is a logical distribution performed by Cassandra based on the topology provided by Kubernetes and communicated by a snitch. Cassandra relies on snitches to provide the topology distribution so that replicas are configured to deploy to individual nodes. Default snitches generally provide adequate topology distribution, allowing rack designations to align with availability zones.

Token Range Management in Distributed Setups

Token ranges define how data is distributed across the nodes in the cluster. Challenges with token range management, particularly during node recovery or failure scenarios, can influence data availability or performance. Token ranges indicate which node is primarily responsible for the data in that token range. The primarily responsible node copies data to other nodes so the cluster remains accessible if the primarily responsible node for that token range goes down. Cassandra carries out the allocation of these ranges using the topology information provided by snitches.

Kubernetes and Cassandra Integration

Key Distinction: Kubernetes Pod Placement vs. Cassandra Data Distribution

There is a fundamental distinction between how Kubernetes manages pod placement and how Cassandra distributes and replicates data. Kubernetes schedules pods based on resource availability and scheduling policies. Cassandra's internal logic for data distribution and replication is governed by its token ring, replication strategy, and rack awareness.

Cassandra's token range management and rack awareness, configured through its snitch mechanism, dictate how data is distributed and replicated across nodes. To achieve optimal data availability and redundancy, it is necessary to align Kubernetes pod placement with Cassandra's topology awareness. This alignment is primarily achieved by configuring the appropriate Cassandra snitch and utilizing Kubernetes features like StatefulSets and pod anti-affinity rules.

Cassandra Snitches

Topology Awareness for Data Distribution

Cassandra utilizes a feature called "snitches" to understand the network topology of the cluster. While snitches are a core component of Cassandra's architecture, their configuration and implementation are heavily influenced by the deployment environment, particularly in cloud-based or containerized environments like Kubernetes.

Core Functionality of Cassandra Snitches

Snitches are essential for providing Cassandra with information about the topology of the cluster. Snitches must translate a cloud provider environment into Cassandra's understanding of the network layout. This includes translating regions, availability zones, nodes, and pods into details such as data centers, availability zones, and racks. Snitches enable Cassandra to route read and write requests to the appropriate nodes based on data locality and replication strategy, minimizing network latency. By understanding the topology, Cassandra can distribute data replicas across different racks or availability zones, ensuring fault tolerance and data availability even if a zone fails.

Cloud providers like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure organize their infrastructure into regions and availability zones. To leverage cloud-specific topology information, Cassandra offers cloud-aware snitches such as `Ec2Snitch` (for AWS), `GoogleCloudSnitch` (for GCP), and `AzureSnitch` (for Azure). These cloud-specific snitches interact with the cloud provider's metadata APIs to automatically discover the region and availability zone of each Cassandra node. While cloud-specific snitches like `GoogleCloudSnitch` are not always immediately necessary, they are a critical tool for resolving performance bottlenecks or addressing node repair delays. This integration allows Cassandra to seamlessly adapt to the dynamic and distributed nature of cloud environments. Snitches enable Cassandra to distribute data replicas across different availability zones, ensuring high availability and fault tolerance. 

Contributors
Comments
a_aleinikov
Bronze 3
Bronze 3

Hi @friasc ,

Thanks for sharing this detailed guide.
Clear explanation about single-region HA best practices, especially the importance of snitch configuration and token range management in Kubernetes environments.
Very helpful and I really appreciate it!

Version history
Last update:
Monday
Updated by: