What is the secret sauce to GKE Cost Optimization ...

dheerajpanyam · 10-16-2023 10:06 AM

We deployed a Dependency Tracker backend on a Standard GKE Cluster with 7 nodes that account for 56 vCPU and 224 GB RAM. The core component is the API server that has a minimum container requirement of 2 vCPU and 4.5 GB RAM. It is intended to be a multi-tenant system (isolation using k8s namespaces) so 1 instance of API server per namespace. The issues we are seeing are below

1. Cluster is highly underutilized:- From the GKE Cost Optimization dashboard which seems misleading the cluster is highly under utilized reason being when users are active on the system we definitely see CPU and memory usage increase though it still does not justify the under utilization. How do we fix this?

2. Auto scaling issues:- We are seeing pretty much default settings with Cluster Autoscaler and Node Autoscaler. The DT API server (deployment not statefulset) uses a single PV but that's about it.

Attaching all the images from GKE cluster.

mahmoudrabie

Hi,

IMHO, to address the (secret sauce) is a combination of proactive resource management, smart scaling strategies, and ongoing monitoring and adjustment. Here is a course of action.

(1) Adjusting Pod Requests and Limits

You would need to update your Kubernetes deployment YAML to adjust the requests and limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3 # Adjust this number based on your scaling needs
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: api-server
        image: dependencytrack/apiserver
        ports:
        - containerPort: 8080
        resources:
          limits:
            cpu: "3" # Adjusted from 4 to 3 based on observed usage
            memory: "8Gi" # Adjusted from 16Gi to 8Gi
          requests:
            cpu: "1" # Adjusted from 2 to 1
            memory: "4Gi" # Adjusted from 5Gi to 4Gi
        volumeMounts:
        - mountPath: /data
          name: dependency-track
      volumes:
      - name: dependency-track
        persistentVolumeClaim:
          claimName: <your-pvc-name>

(2) Implement Horizontal Pod Autoscaler (HPA)

Set up HPA to scale based on CPU and memory usage.

kubectl autoscale deployment api-server --cpu-percent=50 --min=1 --max=10

(3) Using Affinity and Anti-Affinity

To set up pod affinity and anti-affinity, you would modify your deployment configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - api-server
            topologyKey: "kubernetes.io/hostname"

(4) Setting Budgets and Cost Allocation Tags

To create budgets and set up cost alerts, you would use the GCP console or gcloud CLI, not Kubernetes configuration files. However, to use labels for cost allocation:

kubectl label pods <pod-name> team=finance

(5) Configure Cluser Autoscaler

Ensure your cluster autoscaler is set up correctly for your node pools. Here is the Terraform (IaC)

resource "google_container_cluster" "primary" {
  # ... other cluster specs
  node_pool {
    # ... other node pool specs
    autoscaling {
      min_node_count = 1
      max_node_count = 10
    }
  }
}

(6) Optimize Node Pool Management

Create multiple node pools for different workload needs.

resource "google_container_node_pool" "secondary" {
  # ... other node pool specs
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
  management {
    auto_repair  = true
    auto_upgrade = true
  }
  node_config {
    # Specify a different machine type, disk size, etc.
  }
}

(7) Monitor Autoscaling Events

Review the autoscaling events to understand the scaling behavior.

gcloud logging read "resource.type=\"k8s_cluster\" AND jsonPayload.message: \"ClusterAutoscaler\"" --limit 10 --format "table(timestamp, jsonPayload.message)"

I hope it helps

Best Regards

Mahmoud

dheerajpanyam

Thanks @mahmoudrabie . (1) - Right sizing the workloads or understanding the resource usage is the biggest challenge not easy.

mahmoudrabie

100% agree and this is the common pain

The absence of adequate performance analysis and the lack of proactive monitoring.

What is the secret sauce to GKE Cost Optimization and GKE Autoscaling (Standard GKE)?