custom compute class stops scaling of the entire c...

sahil-R · 11-13-2024 04:19 AM

I was trying to introduce custom compute class in the production, my config uses the workload based selection of custom compute using node selectors, when i added my 1st compute class for one of our deployments it worked fine and autoscaling was working perfectly, when i deployed the second custom compute class and redeployed another workload, it shifted to custom compute, but after 4/5 hrs autoscaling stopped for the entire cluster , including workloads that were not using custom compute.

scrubbed custom compute config i am using:

apiVersion: cloud.google.com/v1

kind: ComputeClass

metadata:

name: temp

spec:

priorities:

- nodepools: [temp-prod-od] #<autoscaling 0-3,nodetype: n2d-highcpu-8>

- nodepools: [temp-prod-spot-2,temp-prod-spot] #<autoscaling 0-30 ,nodetype: n2d-highcpu-8, n2-highcpu-8>

- nodepools: [temp-prod-od-backup-2,temp-prod-od-backup-1] #<autoscaling 0-30 ,nodetype: n2d-highcpu-8, n2-highcpu-8>

activeMigration:

optimizeRulePriority: true

cloudgeek7

what is the error message you are receving

sahil-R

Hi, The problem is i don't see any error messages, just that the pods are unschedulable , for both the one's having custom compute class, without custom compute class

francislouie

Hi @sahil-R,

Welcome to Google Cloud Community!

Based on the thread, you were encountering unschedulable pods. According to this documentation, a Pod becomes unschedulable when the Kubernetes scheduler cannot place it on any existing node due to insufficient resources, node constraints, or unmet Pod requirements.

Insufficient resources can prevent autoscaling from creating new node pools. To diagnose the problem, run this query in the Google Cloud console and look for these error messages:

log_id(cloudaudit.googleapis.com/activity)
resource.labels.cluster_name="CLUSTER_NAME"
protoPayload.status.message:("ZONE_RESOURCE_POOL_EXHAUSTE" OR "does not have enough resources available to fulfill
the request" OR "resource pool exhausted" OR "does not exist in zone")

For recommendation, you can try to remove the CRD compute class and monitor the autoscaling if it will work again.

For further troubleshooting, you may refer to these documentations:

I hope the above information is helpful.

custom compute class stops scaling of the entire cluster