on 06-30-2023 12:24 AM - edited on 06-30-2023 12:27 AM by greatdevaks
This article is co-authored by @greatdevaks and @sampriyadarshi.
This article is the second part of the Advanced Apigee Hybrid Cassandra Operations Blog Series which focuses on covering the nitty-gritties of Apigee Hybrid Cassandra Lifecycle Operations. We strongly recommend going through Part 1 first. In this article, we will dive deeper into Cassandra Database and look at some of its internals while also looking at some of the common troubleshooting techniques.
Before we dive deeper, let’s take a look at some of the key terminologies:
Now let’s dive a little deeper into the concepts like Replication Factor and Replication Strategies.
As hardware problems can occur or physical links can go down at any time in a datacenter, data processing operations can get affected. A solution is required to replicate copies of data across multiple Nodes in order to avoid data loss. Data replication is generally performed to ensure that no single point of failure is present in the system.
Cassandra, being a distributed database, places Replicas of data on different Nodes.
Replication Factor of One (1) means that there is only a single copy of data in a Datacenter/Region, while Replication Factor of Three (3) means that there are three copies of data on three different Nodes in a Datacenter/Region.
For Production configuration, for ensuring that there is no single point of failure, it is recommended to have the Replication Factor of Three (3). In order for this configuration to work in Apigee Hybrid, the cassandra.replicaCount property should be set to 3 in the Apigee Hybrid overrides.yaml file.
The Replication Factor denotes the number of copies of each row of data that Cassandra stores. The Replication Factor is set for each Cassandra Keyspace, and the default Replication Factor is 3 for all the Cassandra Keyspaces. As a result, it is recommended to scale Cassandra Nodes by a factor of three. Ideally, the Cassandra Nodes should be distributed across three Availability Zones of a Region/Datacenter, so that the data is replicated in all those 3 Availability Zones.
Pro Tip: The Replication Factor should not exceed the number of Nodes in a Cluster.
Every Cassandra Node in a Cluster is assigned one or more token range(s) for data in a continuous ring form. For example, if there are three Cassandra Nodes in a Cluster and the token range (hypothetical) is 0-59, the token range assignment for the Cassandra Nodes would look like:
When a request to write data to a Cassandra Node is made, Cassandra hashes the data to get a token value and then tries to place the hashed data on the Cassandra Node which has the appropriate token range to store the token.
When placing data on the Cassandra Nodes, Snitches and Replication Strategies are used.
There are two kinds of Replication Strategies in Cassandra, as explained below.
SimpleStrategy is used when there is just one datacenter with one rack. It relies on SimpleSnitch which returns a list of all the Cassandra Nodes in a Cassandra Ring. When data is to be written to a Cassandra Node, SimpleStrategy attempts to place the data (first Replica, depending on the defined Replication Factor) on the Cassandra Node which has the appropriate token range which can fit the token value. After that, the remaining Replicas are placed in a clockwise direction (the Cassandra Node with the next higher token range is chosen on every such Replica placement attempt) in the Cassandra Ring.
NetworkTopologyStrategy is a more complex rack-aware replication strategy that tries to avoid placement of two Replicas on the same rack in a datacenter. It uses PropertyFileSnitch which maintains information about which Cassandra Node belongs to which datacenter and rack. It places Replicas of each key/value pair on multiple Nodes, ensuring that there is at least one Replica of each key/value pair in each datacenter.
Apigee Hybrid uses NetworkTopologyStrategy.
The Replication Factor of the Apigee Hybrid Keyspaces can be seen to be set as 3 and NetworkTopologyStrategy being used, if a SELECT query on system_schema Keyspace is triggered, as shown in the image below.
It is essential to ensure that Apigee Hybrid Cassandra runs smoothly and efficiently all the time. Apigee Hybrid Cassandra, being a complex component, may encounter issues. Identifying the root causes for such issues can be a complex and time-consuming process.
Now that we understand how data gets replicated in Cassandra, let’s check out how to troubleshoot some of the common issues in Cassandra.
When troubleshooting a Cassandra issue, it is important to have a clear understanding of the Cluster's configuration and how it is being used. It is also important to have a good understanding of the Cassandra logs, which can provide valuable information about the cause of the issue.
The following are some of the most commonly observed issues in Apigee Hybrid Cassandra:
The sub-sections below describe, with the help of examples, some of the most common troubleshooting techniques which can help troubleshoot Cassandra issues.
Use-case reference: Official Documentation
Let’s assume that you had a dual-region/datacenter Apigee Hybrid setup which you modified to a single region/datacenter setup due to some business or technical reason. Now, you again want to have a dual-region/datacenter setup to establish proper HA/DR for Apigee Hybrid and are trying to expand your Apigee Hybrid setup from single region/datacenter to multi-region/datacenter. While performing the region/datacenter expansion, you see that the Cassandra Pods in the new region/datacenter are not coming up and are in CrashLoopBackOff state. You want to find the root cause of the issue and fix the same at the earliest.
How would you proceed? Check out the troubleshooting flow described below.
The following command can be used to check the Cassandra logs in case the Cassandra Pods are not in a healthy state or issues with Cassandra are suspected.
kubectl logs -n apigee -l app=apigee-cassandra -f
After running the above-mentioned command, say, it turned out that the Cassandra Pods were reporting the following error.
Exception (java.lang.RuntimeException) encountered during startup:
A node with address 10.52.18.40 already exists, cancelling join.
use cassandra.replace_addrees if you want to replace this node.
The next step should be to somehow get access to the Cassandra’s Cluster configuration and perform further debugging.
For collecting the logs from all the Namespaces of the Apigee Hybrid Runtime Cluster, the following command can be run.
kubectl cluster-info dump --output-directory logs_<directory_name> --all-namespaces --output yaml
nodetool is a very useful utility which comes bundled with the Cassandra installation. It helps identify issues at Cassandra Node level and gives a lot of insights into the state of the Cassandra process itself.
Some of the most commonly used nodetool sub-commands from Apigee Hybrid Cassandra perspective are stated below.
Use the below-mentioned command to check the status of the Cassandra Nodes.
# check cassandra cluster status
kubectl -n apigee get pods \
-l app=apigee-cassandra \
--field-selector=status.phase=Running \
-o custom-columns=name:metadata.name --no-headers \
| xargs -I{} sh -c "echo {}; kubectl -n apigee exec {} -- nodetool -u <username> -pw <password> status"
Let’s say that the above-mentioned command returned the following status.
It can be seen that some stale records from the previously deleted Secondary Datacenter/Region are still there in the Cassandra Cluster which are causing the issue.
This section explains the use of cqlsh to debug issues with Cassandra. cqlsh can be used to query Cassandra Tables to extract useful information.
A client container can be used to run cqlsh commands for debugging. The steps to create a client container are described below.
The client container uses the TLS Certificate from apigee-cassandra-user-setup Pod. In order to get the exact certificate name, the following command should be run.
kubectl get secrets -n apigee --field-selector type=kubernetes.io/tls | grep apigee-cassandra-user-setup | awk '{print $1}'
Create a file named, say, cassandra-client.yaml to store the following cqlsh Pod specifications.
apiVersion: v1
kind: Pod
metadata:
labels:
name: my-cassandra-client # For example: my-cassandra-client
namespace: apigee
spec:
containers:
- name: my-cassandra-client
image: "gcr.io/apigee-release/hybrid/apigee-hybrid-cassandra-client:1.9.3" # For example, 1.9.3.
imagePullPolicy: Always
command:
- sleep
- "3600"
env:
- name: CASSANDRA_SEEDS
value: apigee-cassandra-default.apigee.svc.cluster.local
- name: APIGEE_DML_USER
valueFrom:
secretKeyRef:
key: dml.user
name: apigee-datastore-default-creds
- name: APIGEE_DML_PASSWORD
valueFrom:
secretKeyRef:
key: dml.password
name: apigee-datastore-default-creds
volumeMounts:
- mountPath: /opt/apigee/ssl
name: tls-volume
readOnly: true
volumes:
- name: tls-volume
secret:
defaultMode: 420
secretName: apigee-cassandra-user-setup-rg-hybrid-b7d3b9c-tls # For example: apigee-cassandra-user-setup-rg-hybrid-b7d3b9c-tls
restartPolicy: Never
Apply the Pod Specifications to the target Kubernetes Cluster which is hosting the Apigee Hybrid Runtime Plane components.
kubectl apply -f cassandra-client.yaml -n apigee
Exec into the client container in order to perform the debugging.
kubectl exec -n apigee cassandra-client -it -- bash
Connect to the Cassandra cqlsh interface with the following command.
cqlsh ${CASSANDRA_SEEDS} -u ${APIGEE_DML_USER} -p ${APIGEE_DML_PASSWORD} --ssl
Once the connection to cqlsh has been made, queries can be triggered for performing the desired actions.
For the above-described hypothetical use-case, the following commands can be triggered for debugging and resolving the issue.
Trigger the below-mentioned query to get the Keyspace definitions.
select * from system_schema.keyspaces;
Let’s say the query resulted in the following output.
bash-4.4# cqlsh 10.50.112.194 -u <username> -p <password> --ssl
Connected to apigeecluster at 10.50.112.194:9042.
[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
admin_user@cqlsh> Select * from system_schema.keyspaces;
keyspace_name | durable_writes | replication
-------------------------------------+----------------+--------------------------------------------------------------------------------------------------
system_auth | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
kvm_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
kms_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
system_distributed | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
perses | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
cache_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
rtc_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
quota_tsg1_apigee_hybrid_prod_hybrid | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system_traces | True | {'Primary-DC1': '3', 'Secondary-DC2': '3', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
(11 rows)
It can be seen from the output that references to Secondary-DC2 i.e. stale records are still there and these need to be removed in order to bring the setup to a clean state. The remaining cleanup process can be checked at the official documentation.
cqlsh can further be used to understand how data is stored in Cassandra for Apigee Hybrid.
One can describe the Keyspaces and Tables, as shown below.
One can check the schema of a table, as shown below.
One can even look at the data inside a Table by using the SELECT statement. The following image shows data corresponding to KVMs.
Use-case reference: Official Documentation
Let’s assume that you were performing some Apigee Hybrid maintenance activity and you deleted the Cassandra workload. You are attempting to redeploy the Cassandra workload and the Cassandra Pods are resulting in CrashLoopBackOff state.
The Cassandra logs report some issue with the snitch’s datacenter differing from the previous datacenter.
Cannot start node if snitch's data center (us-east1) differs from previous data center
The potential root cause for the same could be some stale PVCs being present in the Cluster and being referenced by the Cassandra Pods. Check the official documentation for more details and how to resolve the issue.
In this article we looked at some of the internals of Cassandra and dived deeper into the Apigee Cassandra troubleshooting techniques (these techniques are good for getting started, however, there are more advanced and involved techniques which we will cover in the future parts of this series). The article took two examples and highlighted where the knowledge related to Cassandra internals can be applied.
The next part of this Advanced Apigee Hybrid Cassandra Operations Blog Series will cover the CSI Backup/Restore Prodecure for Apigee Hybrid Cassandra. Stay tuned 🙂
We used to do nodetool -pr on the on prem Cassandras, I don't see any mention of that antientropy maintenance task in the apigee hybrid administer documentation. Does it require on the Apigee hybrid Cassandras(using v1.12 AH)?.
@sampriyadarshi can you please help me with above?
Hi @sampriyadarshi , in the above you mentioned username and password, I tried by all means , says in correct username and password. What exactly should i need to mention.