Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

custom compute class stops scaling of the entire cluster

I was trying to introduce custom compute class in the production, my config uses the workload based selection of custom compute using node selectors, when i added my 1st compute class for one of our deployments it worked fine and autoscaling was working perfectly, when i deployed the second custom compute class and redeployed another workload, it shifted to custom compute, but after 4/5 hrs autoscaling stopped for the entire cluster , including workloads that were not using custom compute.

scrubbed custom compute config i am using:

apiVersioncloud.google.com/v1
kindComputeClass
metadata:
nametemp
spec:
priorities:
nodepools: [temp-prod-od#<autoscaling 0-3,nodetype: n2d-highcpu-8>
nodepools: [temp-prod-spot-2,temp-prod-spot#<autoscaling 0-30 ,nodetype: n2d-highcpu-8, n2-highcpu-8>
nodepools: [temp-prod-od-backup-2,temp-prod-od-backup-1#<autoscaling 0-30 ,nodetype: n2d-highcpu-8, n2-highcpu-8>
activeMigration:
optimizeRulePrioritytrue
0 3 436
3 REPLIES 3

what is the error message you are receving

Hi, The problem is i don't see any error messages, just that the pods are unschedulable , for both the one's having custom compute class, without custom compute class   

Hi @sahil-R,

Welcome to Google Cloud Community!

Based on the thread, you were encountering unschedulable pods. According to this documentation, a Pod becomes unschedulable when the Kubernetes scheduler cannot place it on any existing node due to insufficient resources, node constraints, or unmet Pod requirements. 

Insufficient resources can prevent autoscaling from creating new node pools. To diagnose the problem, run this query in the Google Cloud console and look for these error messages:

log_id(cloudaudit.googleapis.com/activity)
resource.labels.cluster_name="CLUSTER_NAME"
protoPayload.status.message:("ZONE_RESOURCE_POOL_EXHAUSTE" OR "does not have enough resources available to fulfill
the request" OR "resource pool exhausted" OR "does not exist in zone")

For recommendation, you can try to remove the CRD compute class and monitor  the autoscaling if it will work again.

For further troubleshooting, you may refer to these documentations:

I hope the above information is helpful.

Top Labels in this Space