Monitoring Cassandra nodes status in Apigee Hybrid with Prometheus

Introduction

Apigee hybrid relies on Cassandra for storing critical runtime data. Ensuring the health and proper status of Cassandra nodes is therefore paramount for the stability and performance of your API platform. This article provides a comprehensive guide to setting up Cassandra monitoring in an Apigee hybrid environment using Prometheus, based on a practical Kubernetes-based approach.

Purpose

This guide details the process of monitoring Cassandra nodes status within Apigee hybrid deployments and exporting the collected metrics to Prometheus for visualization and alerting.

This solution serves as a reference implementation that can be used as an example to create custom metrics based on hybrid runtime pods and visualize them with Prometheus/Grafana.

You can find the material of this article in the following git repository.

Apigee Hybrid's Built-in Metrics Collection

Apigee hybrid collects operational metrics that you can use to monitor the health of hybrid runtime plane services and API traffic. 

Apigee employs OpenTelemetry for metrics collection within its architecture. Apigee hybrid subsequently sends the collected metrics data to Cloud Observability. At this point, you can utilize the Cloud Monitoring console for viewing, searching, analyzing metrics, and managing alerts.

intro.png

This built-in metrics collection offers a foundational layer for monitoring the Apigee hybrid platform itself, while the subsequent sections detail how to augment this with custom metrics specific monitoring.

For an exhaustive list of Cassandra metrics related to Apigee hybrid, you can read the Apigee hybrid documentation: Viewing metrics/Cassandra metrics.

Overview

The monitoring setup is implemented directly on the Kubernetes cluster where the Apigee hybrid runtime is deployed. It involves deploying a custom script, configuring RBAC, managing persistent storage, and setting up Prometheus scraping.

Prerequisites

  • Apigee hybrid runtime environment (e.g., version 1.14) installed on a functional Kubernetes cluster (e.g. GKE)
  • kubectl command-line tool configured for cluster access and interaction
  • Access to Docker Hub or a private container registry.
  • helm package manager installed for Prometheus deployment

Cassandra Monitoring Setup with Prometheus on Kubernetes

This section outlines a method to monitor Cassandra replication status within Apigee hybrid deployments by deploying a custom monitoring solution on the Kubernetes cluster. 

The collected metrics are then exported to Prometheus for enhanced visualization and alerting capabilities.

1. Deploy a Script to Check Cassandra Status and Export Metrics

The first step involves deploying a script to check Cassandra status and export metrics. The provided Bash script, check_cassandra_replication.sh, serves this purpose.

 This script connects to Cassandra using nodetool within an Apigee Cassandra pod. It then retrieves Cassandra node status, determining whether nodes are UP or DOWN. 

The script calculates the number of UP and DOWN nodes, and it has the option to perform Cassandra repairs if necessary, utilizing the nodetool repair command. Finally, the script exports the collected metrics in Prometheus format.

The script's functionality relies on several key elements. 

Environment variables such as NAMESPACE, POD_NAME, JMX_USER, and JMX_PASSWORD are used to configure the script's operation. A more secured approach would be to use kubernetes secrets instead of environment variables to set JMX_USER and JMX_PASSWORD.

Temporary files are used to store the output of nodetool commands. Command-line arguments, specifically --repair and --prometheus, control the script's behavior. The script analyzes the output of the nodetool status command to determine the health of Cassandra nodes. 

Metrics are exported in the Prometheus exposition format, which includes metrics such as cassandra_nodes_total (total number of Cassandra nodes), cassandra_nodes_up (number of active nodes), cassandra_nodes_down (number of offline nodes), and cassandra_repair_status (status of the last repair operation).

Here is an example of the content that is created:

res.png

Here is the shell script:

#!/bin/bash

# Variables (adjust if they are not set in the environment)
NAMESPACE="apigee"
POD_NAME="apigee-cassandra-default-0"
JMX_USER="${APIGEE_JMX_USER:-ddl_user}"  # Replace <default_user> if needed
JMX_PASSWORD="${APIGEE_JMX_PASSWORD:-iloveapis123}"  # Replace <default_password> if needed

# Temporary files to store outputs
TEMP_STATUS_FILE="/tmp/nodetool_status.txt"
TEMP_REPAIR_FILE="/tmp/nodetool_repair.txt"
PROMETHEUS_FILE="${PROMETHEUS_FILE:-/metrics-data/cassandra_replication.prom}"  # Configurable via env var

# Options for forcing repair or exporting to Prometheus
FORCE_REPAIR=false
EXPORT_PROMETHEUS=false
for arg in "$@"; do
  if [ "$arg" == "--repair" ]; then
    FORCE_REPAIR=true
  elif [ "$arg" == "--prometheus" ]; then
    EXPORT_PROMETHEUS=true
  fi
done

# Execute nodetool status
echo "Retrieving Cassandra node status..."
kubectl exec $POD_NAME -n $NAMESPACE -- nodetool -u "$JMX_USER" -pw "$JMX_PASSWORD" status > "$TEMP_STATUS_FILE" 2>/tmp/nodetool_error.log

# Check if the command succeeded
if [ $? -ne 0 ]; then
  echo "Error executing nodetool status:"
  cat /tmp/nodetool_error.log
  exit 1
fi

# Analyze the output to count UP and DOWN nodes
TOTAL_NODES=$(grep -E '^[UD][NLJM]' "$TEMP_STATUS_FILE" | wc -l)
UP_NODES=$(grep -E '^UN' "$TEMP_STATUS_FILE" | wc -l)
DOWN_NODES=$(grep -E '^DN' "$TEMP_STATUS_FILE" | wc -l)

# Display the results
echo "Supervision results:"
echo "Total number of nodes: $TOTAL_NODES"
echo "UP nodes (active): $UP_NODES"
echo "DOWN nodes (offline): $DOWN_NODES"

# Initialize repair status (-1 means not run)
REPAIR_STATUS=-1

# Check the status and decide if repair is needed
REPAIR_NEEDED=false
if [ "$DOWN_NODES" -gt 0 ]; then
  echo " ALERT: $DOWN_NODES node(s) are DOWN. Replication may be compromised."
  echo "Details of DOWN nodes:"
  grep -E '^DN' "$TEMP_STATUS_FILE"
  REPAIR_NEEDED=true
elif [ "$FORCE_REPAIR" = true ]; then
  echo " Forced repair requested via --repair."
  REPAIR_NEEDED=true
else
  echo " All nodes are UP. Replication appears to be effective."
fi

# Execute nodetool repair if necessary
if [ "$REPAIR_NEEDED" = true ]; then
  echo "Running nodetool repair to synchronize data..."
  kubectl exec $POD_NAME -n $NAMESPACE -- nodetool -u "$JMX_USER" -pw "$JMX_PASSWORD" repair -full > "$TEMP_REPAIR_FILE" 2>>/tmp/nodetool_error.log

  # Check if the repair succeeded
  if [ $? -eq 0 ]; then
    echo " Repair completed successfully. Check details in $TEMP_REPAIR_FILE if needed."
    REPAIR_STATUS=1
  else
    echo " Repair failed:"
    cat /tmp/nodetool_error.log
    REPAIR_STATUS=0
    exit 1
  fi
fi

# Export to Prometheus if requested
if [ "$EXPORT_PROMETHEUS" = true ]; then
  echo "Exporting metrics to Prometheus file: $PROMETHEUS_FILE"
  mkdir -p "$(dirname "$PROMETHEUS_FILE")"  # Ensure the directory exists (should be writable via volume)
  cat <<EOF > "$PROMETHEUS_FILE"
# HELP cassandra_nodes_total Total number of Cassandra nodes in the cluster
# TYPE cassandra_nodes_total gauge
cassandra_nodes_total $TOTAL_NODES
# HELP cassandra_nodes_up Number of Cassandra nodes currently UP
# TYPE cassandra_nodes_up gauge
cassandra_nodes_up $UP_NODES
# HELP cassandra_nodes_down Number of Cassandra nodes currently DOWN
# TYPE cassandra_nodes_down gauge
cassandra_nodes_down $DOWN_NODES
# HELP cassandra_repair_status Status of the last repair operation (1=success, 0=failure, -1=not run)
# TYPE cassandra_repair_status gauge
cassandra_repair_status $REPAIR_STATUS
EOF
  if [ $? -eq 0 ]; then
    echo " Metrics successfully exported to $PROMETHEUS_FILE"
  else
    echo " Failed to export metrics to $PROMETHEUS_FILE"
    exit 1
  fi
fi

# Clean up temporary files
rm -f "$TEMP_STATUS_FILE" "$TEMP_REPAIR_FILE" /tmp/nodetool_error.log
# Signal success and terminate pod
echo "Metrics written successfully, terminating pod..."
exit 0

Accompanying this script is a Dockerfile, which is used to containerize the script for deployment. This Dockerfile is based on Ubuntu (20.04). 

It installs kubectl and other necessary tools. The check_cassandra_replication.sh script is copied into the container. 

Finally, the script is set as the container's entrypoint.

Dockerfile

FROM ubuntu:20.04

# Install kubectl and basic tools
RUN apt-get update && apt-get install -y curl bash && \
    curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && \
    chmod +x kubectl && mv kubectl /usr/local/bin/

# Copy the script
COPY check_cassandra_replication.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/check_cassandra_replication.sh

# Set entrypoint
ENTRYPOINT ["/usr/local/bin/check_cassandra_replication.sh"]

 2. Build and Push the Docker Image

This step involves building and pushing the Docker image. The Docker image is constructed using the Dockerfile. It is then tagged and pushed to a container registry, which could be Docker Hub or a private container registry. 

For example, when using Google Cloud Artifact Registry, the following commands can be used: 

docker build -t <prefix>/cassandra-monitor:latest .
docker tag <prefix>/cassandra-monitor <region>-docker.pkg.dev/<project_id>/images-repo/<prefix>/cassandra-monitor
docker push <region>-docker.pkg.dev/<project_id>/images-repo/<prefix>/cassandra-monitor

Notes:

  • <prefix> must be replaced by a string, like your name, firstname
  • <project_id> is your Google Cloud project identifier where the Artifact Registry will be created
  • <region> is the Google Cloud region where the Artifact Registry is set
  • images-repo is the name of the artifact registry repository that is created. You can change it for a name of your convenience

3. Set Up RBAC (Role-Based Access Control)

This step presents how to set up Role-Based Access Control (RBAC). RBAC is configured to grant the monitoring pod the necessary permissions to interact with the Kubernetes API and execute commands within the Apigee Cassandra pods. 

A ServiceAccount named cassandra-monitor-sa is created in the apigee namespace. 

cat << 'EOF' > cassandra-monitor-sa.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cassandra-monitor-sa
  namespace: apigee
EOF

kubectl apply -f cassandra-monitor-sa.yaml

Subsequently, a Role named cassandra-monitor-role is defined in the apigee namespace.

cat << 'EOF' > cassandra-monitor-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cassandra-monitor-role
  namespace: apigee
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["create"]
EOF

kubectl apply -f cassandra-monitor-role.yaml

This role grants permissions to get and list pods, as well as create pods/exec, which is required to execute commands within pods.

Finally, a RoleBinding named cassandra-monitor-rolebinding is created in the apigee namespace. 

cat << 'EOF' > cassandra-monitor-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cassandra-monitor-rolebinding
  namespace: apigee
subjects:
- kind: ServiceAccount
  name: cassandra-monitor-sa
  namespace: apigee
roleRef:
  kind: Role
  name: cassandra-monitor-role
  apiGroup: rbac.authorization.k8s.io
EOF

kubectl apply -f cassandra-monitor-rolebinding.yaml

This binds the cassandra-monitor-sa ServiceAccount to the cassandra-monitor-role, thus granting the defined permissions.

4. Set Up Persistent Storage

This step involves setting up persistent storage. Persistent storage is necessary to store the exported Prometheus metrics. 

A PersistentVolumeClaim (PVC) named cassandra-metrics-pvc is created in the apigee namespace. 

This PVC requests 1Gi of storage with ReadWriteOnce access mode. The storageClassName is set to standard, although this should be adjusted as needed for your specific environment. It is crucial to verify that the PVC is successfully bound to a Persistent Volume.

cat << 'EOF' > cassandra-metrics-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cassandra-metrics-pvc
  namespace: apigee
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: standard
EOF

kubectl apply -f cassandra-metrics-pvc.yaml

5. Deploy the CronJob

A CronJob is deployed to schedule the execution of the Cassandra monitoring script. The cassandra-monitor-cronjob.yaml file defines this CronJob. It specifies the schedule for running the job. 

The job uses the Docker image built in the second step. It mounts the PersistentVolumeClaim created in the fourth step to store the metrics. It also uses the ServiceAccount created in the third step for RBAC permissions. 

The CronJob will then create pods according to the defined schedule, and each pod will execute the monitoring script.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cassandra-monitor
  namespace: apigee
spec:
  schedule: "*/5 * * * *"
  successfulJobsHistoryLimit: 1  # Keep only the latest successful job
  failedJobsHistoryLimit: 1      # Keep only the latest failed job
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: cassandra-monitor
        spec:
          serviceAccountName: cassandra-monitor-sa
          containers:
          - name: cassandra-monitor
            image: <region>-docker.pkg.dev/<project_id>/images-repo/joelgauci/cassandra-monitor:latest
            args: ["--prometheus"]
            env:
            - name: APIGEE_JMX_USER
              value: "<jmx_user>"
            - name: APIGEE_JMX_PASSWORD
              value: "<jmx_password>"
            - name: PROMETHEUS_FILE
              value: "/metrics-data/cassandra_replication.prom"  # Explicitly set the path
            volumeMounts:
            - name: metrics-volume
              mountPath: /metrics-data
          - name: node-exporter
            image: prom/node-exporter:latest
            args:
            - "--collector.textfile.directory=/metrics-data"
            - "--web.listen-address=:9100"
            ports:
            - containerPort: 9100
              name: metrics
            volumeMounts:
            - name: metrics-volume
              mountPath: /metrics-data
          restartPolicy: OnFailure
          volumes:
          - name: metrics-volume
            persistentVolumeClaim:
              claimName: cassandra-metrics-pvc
kubectl apply -f cassandra-monitor-cronjob.yaml

Note:

  • schedule: "*/5 * * * *" : this is a cronjob standard schedule syntax. In this case, the cronjob will execute every 5 minutes

6. Verify the Pod

After the CronJob runs, it's important to verify that the monitoring pod is created and functioning correctly. Verification steps include using kubectl get pods -n apigee to check the pod status. 

You should expect the pod to show (1/2) Ready, which indicates that the node-exporter sidecar container is running, and the cassandra-monitor job has completed. 

Additionally, you can use kubectl logs -n apigee <cassandra-monitor-pod-name> -c cassandra-monitor to inspect the logs of the cassandra-monitor container. 

The logs should demonstrate successful execution of the script and the exporting of metrics (total of 3 Cassandra nodes and the 3 Cassandra nodes are UP).

Retrieving Cassandra node status...
Supervision results:
Total number of nodes: 3
UP nodes (active): 3
DOWN nodes (offline): 0
 All nodes are UP. Replication appears to be effective.
Exporting metrics to Prometheus file: /metrics-data/cassandra_replication.prom
 Metrics successfully exported to /metrics-data/cassandra_replication.prom
Metrics written successfully, terminating pod...

7. Set Up Prometheus Scraping

This step involves setting up Prometheus scraping. 

To enable Prometheus to collect the Cassandra metrics, it is necessary to define a Service and a ServiceMonitor. 

A Service named cassandra-monitor-service is created in the apigee namespace. 

cat << 'EOF' > cassandra-monitor-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: cassandra-monitor-service
  namespace: apigee
  labels:
    app: cassandra-monitor
spec:
  selector:
    app: cassandra-monitor
  ports:
  - name: metrics
    port: 9100
    targetPort: 9100
EOF

kubectl apply -f cassandra-monitor-service.yaml

This service selects pods with the label app: cassandra-monitor. It exposes port 9100, which is the port where the metrics are exposed by the node-exporter sidecar. 

If Prometheus is not already installed in your cluster, you can use helm to install the kube-prometheus-stack. This involves adding the prometheus-community Helm repository, updating the repository, creating a monitoring namespace, and then installing the chart. 

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring

helm list -n monitoring

A ServiceMonitor named cassandra-monitor is created in the monitoring namespace. This ServiceMonitor instructs Prometheus on how to discover and scrape metrics from the cassandra-monitor-service

It uses a selector to match the service's labels. It specifies the port, the scraping interval, and the scheme. It also uses a namespaceSelector to target the apigee namespace where the service is located.

cat << 'EOF' > cassandra-monitor-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cassandra-monitor
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: cassandra-monitor
  endpoints:
  - port: metrics
    interval: 30s
    scheme: http
  namespaceSelector:
    matchNames:
    - apigee
EOF

kubectl apply -f cassandra-monitor-servicemonitor.yaml

Verifications

Verification procedures include several steps. 

  1. For Cassandra monitor pods, the command kubectl get pods -n apigee -l app=cassandra-monitor is used to verify that the cassandra-monitor pods are in the expected state. 
  2. For the Cassandra monitor service, the command kubectl get svc -n apigee is used to check that the cassandra-monitor-service exists. 
  3. For Cassandra monitor endpoints, the command kubectl get endpoints cassandra-monitor-service -n apigee is used to verify that the endpoints for the service are correct and point to the monitoring pods. 
  4. For the Cassandra monitor service monitor, the command kubectl get servicemonitor -n monitoring is used to confirm that the cassandra-monitor ServiceMonitor is created.

Prometheus Web UI access can be achieved via port forwarding. 

The command kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 forwards the Prometheus UI port to your local machine. 

You can then open a web browser to http://localhost:9090/ to access the Prometheus web interface. Within the Prometheus UI, PromQL (Prometheus Query Language) can be used to query the exported Cassandra metrics, such as cassandra_nodes_up, cassandra_nodes_down, cassandra_nodes_total, and cassandra_repair_status.

doc.png

Creating graphs and dashboards in Prometheus allows for the visualization of these metrics over time, enabling effective monitoring and alerting.

 The document provides an example screenshot of a Prometheus dashboard displaying the cassandra_nodes_up metric. You can create similar dashboards to monitor other key metrics and gain insights into your Cassandra cluster's health and replication status. 

Conclusion

By following these steps, you can effectively monitor Cassandra replication in your Apigee hybrid environment using Prometheus. This setup provides valuable insights into Cassandra's health, allows for proactive issue detection, and helps ensure the reliable operation of your API platform. Remember to adapt the configurations and scripts to match the specifics of your Apigee hybrid deployment and Kubernetes environment.

Thanks a lot to my friends and colleagues @omidt  and @ncardace  for their feedback on drafts of this article!

Contributors
Version history
Last update:
Tuesday
Updated by: