Apigee hybrid relies on Cassandra for storing critical runtime data. Ensuring the health and proper status of Cassandra nodes is therefore paramount for the stability and performance of your API platform. This article provides a comprehensive guide to setting up Cassandra monitoring in an Apigee hybrid environment using Prometheus, based on a practical Kubernetes-based approach.
This guide details the process of monitoring Cassandra nodes status within Apigee hybrid deployments and exporting the collected metrics to Prometheus for visualization and alerting.
This solution serves as a reference implementation that can be used as an example to create custom metrics based on hybrid runtime pods and visualize them with Prometheus/Grafana.
You can find the material of this article in the following git repository.
Apigee hybrid collects operational metrics that you can use to monitor the health of hybrid runtime plane services and API traffic.
Apigee employs OpenTelemetry for metrics collection within its architecture. Apigee hybrid subsequently sends the collected metrics data to Cloud Observability. At this point, you can utilize the Cloud Monitoring console for viewing, searching, analyzing metrics, and managing alerts.
This built-in metrics collection offers a foundational layer for monitoring the Apigee hybrid platform itself, while the subsequent sections detail how to augment this with custom metrics specific monitoring.
For an exhaustive list of Cassandra metrics related to Apigee hybrid, you can read the Apigee hybrid documentation: Viewing metrics/Cassandra metrics.
The monitoring setup is implemented directly on the Kubernetes cluster where the Apigee hybrid runtime is deployed. It involves deploying a custom script, configuring RBAC, managing persistent storage, and setting up Prometheus scraping.
This section outlines a method to monitor Cassandra replication status within Apigee hybrid deployments by deploying a custom monitoring solution on the Kubernetes cluster.
The collected metrics are then exported to Prometheus for enhanced visualization and alerting capabilities.
The first step involves deploying a script to check Cassandra status and export metrics. The provided Bash script, check_cassandra_replication.sh, serves this purpose.
This script connects to Cassandra using nodetool within an Apigee Cassandra pod. It then retrieves Cassandra node status, determining whether nodes are UP or DOWN.
The script calculates the number of UP and DOWN nodes, and it has the option to perform Cassandra repairs if necessary, utilizing the nodetool repair command. Finally, the script exports the collected metrics in Prometheus format.
The script's functionality relies on several key elements.
Environment variables such as NAMESPACE, POD_NAME, JMX_USER, and JMX_PASSWORD are used to configure the script's operation. A more secured approach would be to use kubernetes secrets instead of environment variables to set JMX_USER and JMX_PASSWORD.
Temporary files are used to store the output of nodetool commands. Command-line arguments, specifically --repair and --prometheus, control the script's behavior. The script analyzes the output of the nodetool status command to determine the health of Cassandra nodes.
Metrics are exported in the Prometheus exposition format, which includes metrics such as cassandra_nodes_total (total number of Cassandra nodes), cassandra_nodes_up (number of active nodes), cassandra_nodes_down (number of offline nodes), and cassandra_repair_status (status of the last repair operation).
Here is an example of the content that is created:
Here is the shell script:
#!/bin/bash
# Variables (adjust if they are not set in the environment)
NAMESPACE="apigee"
POD_NAME="apigee-cassandra-default-0"
JMX_USER="${APIGEE_JMX_USER:-ddl_user}" # Replace <default_user> if needed
JMX_PASSWORD="${APIGEE_JMX_PASSWORD:-iloveapis123}" # Replace <default_password> if needed
# Temporary files to store outputs
TEMP_STATUS_FILE="/tmp/nodetool_status.txt"
TEMP_REPAIR_FILE="/tmp/nodetool_repair.txt"
PROMETHEUS_FILE="${PROMETHEUS_FILE:-/metrics-data/cassandra_replication.prom}" # Configurable via env var
# Options for forcing repair or exporting to Prometheus
FORCE_REPAIR=false
EXPORT_PROMETHEUS=false
for arg in "$@"; do
if [ "$arg" == "--repair" ]; then
FORCE_REPAIR=true
elif [ "$arg" == "--prometheus" ]; then
EXPORT_PROMETHEUS=true
fi
done
# Execute nodetool status
echo "Retrieving Cassandra node status..."
kubectl exec $POD_NAME -n $NAMESPACE -- nodetool -u "$JMX_USER" -pw "$JMX_PASSWORD" status > "$TEMP_STATUS_FILE" 2>/tmp/nodetool_error.log
# Check if the command succeeded
if [ $? -ne 0 ]; then
echo "Error executing nodetool status:"
cat /tmp/nodetool_error.log
exit 1
fi
# Analyze the output to count UP and DOWN nodes
TOTAL_NODES=$(grep -E '^[UD][NLJM]' "$TEMP_STATUS_FILE" | wc -l)
UP_NODES=$(grep -E '^UN' "$TEMP_STATUS_FILE" | wc -l)
DOWN_NODES=$(grep -E '^DN' "$TEMP_STATUS_FILE" | wc -l)
# Display the results
echo "Supervision results:"
echo "Total number of nodes: $TOTAL_NODES"
echo "UP nodes (active): $UP_NODES"
echo "DOWN nodes (offline): $DOWN_NODES"
# Initialize repair status (-1 means not run)
REPAIR_STATUS=-1
# Check the status and decide if repair is needed
REPAIR_NEEDED=false
if [ "$DOWN_NODES" -gt 0 ]; then
echo " ALERT: $DOWN_NODES node(s) are DOWN. Replication may be compromised."
echo "Details of DOWN nodes:"
grep -E '^DN' "$TEMP_STATUS_FILE"
REPAIR_NEEDED=true
elif [ "$FORCE_REPAIR" = true ]; then
echo " Forced repair requested via --repair."
REPAIR_NEEDED=true
else
echo " All nodes are UP. Replication appears to be effective."
fi
# Execute nodetool repair if necessary
if [ "$REPAIR_NEEDED" = true ]; then
echo "Running nodetool repair to synchronize data..."
kubectl exec $POD_NAME -n $NAMESPACE -- nodetool -u "$JMX_USER" -pw "$JMX_PASSWORD" repair -full > "$TEMP_REPAIR_FILE" 2>>/tmp/nodetool_error.log
# Check if the repair succeeded
if [ $? -eq 0 ]; then
echo " Repair completed successfully. Check details in $TEMP_REPAIR_FILE if needed."
REPAIR_STATUS=1
else
echo " Repair failed:"
cat /tmp/nodetool_error.log
REPAIR_STATUS=0
exit 1
fi
fi
# Export to Prometheus if requested
if [ "$EXPORT_PROMETHEUS" = true ]; then
echo "Exporting metrics to Prometheus file: $PROMETHEUS_FILE"
mkdir -p "$(dirname "$PROMETHEUS_FILE")" # Ensure the directory exists (should be writable via volume)
cat <<EOF > "$PROMETHEUS_FILE"
# HELP cassandra_nodes_total Total number of Cassandra nodes in the cluster
# TYPE cassandra_nodes_total gauge
cassandra_nodes_total $TOTAL_NODES
# HELP cassandra_nodes_up Number of Cassandra nodes currently UP
# TYPE cassandra_nodes_up gauge
cassandra_nodes_up $UP_NODES
# HELP cassandra_nodes_down Number of Cassandra nodes currently DOWN
# TYPE cassandra_nodes_down gauge
cassandra_nodes_down $DOWN_NODES
# HELP cassandra_repair_status Status of the last repair operation (1=success, 0=failure, -1=not run)
# TYPE cassandra_repair_status gauge
cassandra_repair_status $REPAIR_STATUS
EOF
if [ $? -eq 0 ]; then
echo " Metrics successfully exported to $PROMETHEUS_FILE"
else
echo " Failed to export metrics to $PROMETHEUS_FILE"
exit 1
fi
fi
# Clean up temporary files
rm -f "$TEMP_STATUS_FILE" "$TEMP_REPAIR_FILE" /tmp/nodetool_error.log
# Signal success and terminate pod
echo "Metrics written successfully, terminating pod..."
exit 0
Accompanying this script is a Dockerfile, which is used to containerize the script for deployment. This Dockerfile is based on Ubuntu (20.04).
It installs kubectl and other necessary tools. The check_cassandra_replication.sh script is copied into the container.
Finally, the script is set as the container's entrypoint.
FROM ubuntu:20.04
# Install kubectl and basic tools
RUN apt-get update && apt-get install -y curl bash && \
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && \
chmod +x kubectl && mv kubectl /usr/local/bin/
# Copy the script
COPY check_cassandra_replication.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/check_cassandra_replication.sh
# Set entrypoint
ENTRYPOINT ["/usr/local/bin/check_cassandra_replication.sh"]
This step involves building and pushing the Docker image. The Docker image is constructed using the Dockerfile. It is then tagged and pushed to a container registry, which could be Docker Hub or a private container registry.
For example, when using Google Cloud Artifact Registry, the following commands can be used:
docker build -t <prefix>/cassandra-monitor:latest .
docker tag <prefix>/cassandra-monitor <region>-docker.pkg.dev/<project_id>/images-repo/<prefix>/cassandra-monitor
docker push <region>-docker.pkg.dev/<project_id>/images-repo/<prefix>/cassandra-monitor
Notes:
This step presents how to set up Role-Based Access Control (RBAC). RBAC is configured to grant the monitoring pod the necessary permissions to interact with the Kubernetes API and execute commands within the Apigee Cassandra pods.
A ServiceAccount named cassandra-monitor-sa is created in the apigee namespace.
cat << 'EOF' > cassandra-monitor-sa.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: cassandra-monitor-sa
namespace: apigee
EOF
kubectl apply -f cassandra-monitor-sa.yaml
Subsequently, a Role named cassandra-monitor-role is defined in the apigee namespace.
cat << 'EOF' > cassandra-monitor-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cassandra-monitor-role
namespace: apigee
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
EOF
kubectl apply -f cassandra-monitor-role.yaml
This role grants permissions to get and list pods, as well as create pods/exec, which is required to execute commands within pods.
Finally, a RoleBinding named cassandra-monitor-rolebinding is created in the apigee namespace.
cat << 'EOF' > cassandra-monitor-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cassandra-monitor-rolebinding
namespace: apigee
subjects:
- kind: ServiceAccount
name: cassandra-monitor-sa
namespace: apigee
roleRef:
kind: Role
name: cassandra-monitor-role
apiGroup: rbac.authorization.k8s.io
EOF
kubectl apply -f cassandra-monitor-rolebinding.yaml
This binds the cassandra-monitor-sa ServiceAccount to the cassandra-monitor-role, thus granting the defined permissions.
This step involves setting up persistent storage. Persistent storage is necessary to store the exported Prometheus metrics.
A PersistentVolumeClaim (PVC) named cassandra-metrics-pvc is created in the apigee namespace.
This PVC requests 1Gi of storage with ReadWriteOnce access mode. The storageClassName is set to standard, although this should be adjusted as needed for your specific environment. It is crucial to verify that the PVC is successfully bound to a Persistent Volume.
cat << 'EOF' > cassandra-metrics-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cassandra-metrics-pvc
namespace: apigee
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: standard
EOF
kubectl apply -f cassandra-metrics-pvc.yaml
A CronJob is deployed to schedule the execution of the Cassandra monitoring script. The cassandra-monitor-cronjob.yaml file defines this CronJob. It specifies the schedule for running the job.
The job uses the Docker image built in the second step. It mounts the PersistentVolumeClaim created in the fourth step to store the metrics. It also uses the ServiceAccount created in the third step for RBAC permissions.
The CronJob will then create pods according to the defined schedule, and each pod will execute the monitoring script.
apiVersion: batch/v1
kind: CronJob
metadata:
name: cassandra-monitor
namespace: apigee
spec:
schedule: "*/5 * * * *"
successfulJobsHistoryLimit: 1 # Keep only the latest successful job
failedJobsHistoryLimit: 1 # Keep only the latest failed job
jobTemplate:
spec:
template:
metadata:
labels:
app: cassandra-monitor
spec:
serviceAccountName: cassandra-monitor-sa
containers:
- name: cassandra-monitor
image: <region>-docker.pkg.dev/<project_id>/images-repo/joelgauci/cassandra-monitor:latest
args: ["--prometheus"]
env:
- name: APIGEE_JMX_USER
value: "<jmx_user>"
- name: APIGEE_JMX_PASSWORD
value: "<jmx_password>"
- name: PROMETHEUS_FILE
value: "/metrics-data/cassandra_replication.prom" # Explicitly set the path
volumeMounts:
- name: metrics-volume
mountPath: /metrics-data
- name: node-exporter
image: prom/node-exporter:latest
args:
- "--collector.textfile.directory=/metrics-data"
- "--web.listen-address=:9100"
ports:
- containerPort: 9100
name: metrics
volumeMounts:
- name: metrics-volume
mountPath: /metrics-data
restartPolicy: OnFailure
volumes:
- name: metrics-volume
persistentVolumeClaim:
claimName: cassandra-metrics-pvc
kubectl apply -f cassandra-monitor-cronjob.yaml
Note:
After the CronJob runs, it's important to verify that the monitoring pod is created and functioning correctly. Verification steps include using kubectl get pods -n apigee to check the pod status.
You should expect the pod to show (1/2) Ready, which indicates that the node-exporter sidecar container is running, and the cassandra-monitor job has completed.
Additionally, you can use kubectl logs -n apigee <cassandra-monitor-pod-name> -c cassandra-monitor to inspect the logs of the cassandra-monitor container.
The logs should demonstrate successful execution of the script and the exporting of metrics (total of 3 Cassandra nodes and the 3 Cassandra nodes are UP).
Retrieving Cassandra node status...
Supervision results:
Total number of nodes: 3
UP nodes (active): 3
DOWN nodes (offline): 0
All nodes are UP. Replication appears to be effective.
Exporting metrics to Prometheus file: /metrics-data/cassandra_replication.prom
Metrics successfully exported to /metrics-data/cassandra_replication.prom
Metrics written successfully, terminating pod...
This step involves setting up Prometheus scraping.
To enable Prometheus to collect the Cassandra metrics, it is necessary to define a Service and a ServiceMonitor.
A Service named cassandra-monitor-service is created in the apigee namespace.
cat << 'EOF' > cassandra-monitor-service.yaml
apiVersion: v1
kind: Service
metadata:
name: cassandra-monitor-service
namespace: apigee
labels:
app: cassandra-monitor
spec:
selector:
app: cassandra-monitor
ports:
- name: metrics
port: 9100
targetPort: 9100
EOF
kubectl apply -f cassandra-monitor-service.yaml
This service selects pods with the label app: cassandra-monitor. It exposes port 9100, which is the port where the metrics are exposed by the node-exporter sidecar.
If Prometheus is not already installed in your cluster, you can use helm to install the kube-prometheus-stack. This involves adding the prometheus-community Helm repository, updating the repository, creating a monitoring namespace, and then installing the chart.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring
helm list -n monitoring
A ServiceMonitor named cassandra-monitor is created in the monitoring namespace. This ServiceMonitor instructs Prometheus on how to discover and scrape metrics from the cassandra-monitor-service.
It uses a selector to match the service's labels. It specifies the port, the scraping interval, and the scheme. It also uses a namespaceSelector to target the apigee namespace where the service is located.
cat << 'EOF' > cassandra-monitor-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cassandra-monitor
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchLabels:
app: cassandra-monitor
endpoints:
- port: metrics
interval: 30s
scheme: http
namespaceSelector:
matchNames:
- apigee
EOF
kubectl apply -f cassandra-monitor-servicemonitor.yaml
Verification procedures include several steps.
Prometheus Web UI access can be achieved via port forwarding.
The command kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 forwards the Prometheus UI port to your local machine.
You can then open a web browser to http://localhost:9090/ to access the Prometheus web interface. Within the Prometheus UI, PromQL (Prometheus Query Language) can be used to query the exported Cassandra metrics, such as cassandra_nodes_up, cassandra_nodes_down, cassandra_nodes_total, and cassandra_repair_status.
Creating graphs and dashboards in Prometheus allows for the visualization of these metrics over time, enabling effective monitoring and alerting.
The document provides an example screenshot of a Prometheus dashboard displaying the cassandra_nodes_up metric. You can create similar dashboards to monitor other key metrics and gain insights into your Cassandra cluster's health and replication status.
By following these steps, you can effectively monitor Cassandra replication in your Apigee hybrid environment using Prometheus. This setup provides valuable insights into Cassandra's health, allows for proactive issue detection, and helps ensure the reliable operation of your API platform. Remember to adapt the configurations and scripts to match the specifics of your Apigee hybrid deployment and Kubernetes environment.
Thanks a lot to my friends and colleagues @omidt and @ncardace for their feedback on drafts of this article!