Solved: Re: Vertical Pod Autoscaling Disabled but Seeing P...

krm104 · 08-07-2024 12:51 PM

I have a Standard GKE cluster provisioned with a node pool of e2 medium 16 nodes that are set to autoscale. I have workloads with the following resource constraints:

CPU request: 200m
CPU limit: 1000m

However, when the workloads are deployed, I am seeing the following error occur on a large number of the pods:

"Pod was rejected: Node didn't have enough resource: cpu, requested: 250, used: 15739, capacity: 15890"

The error seems to indicate that the pod requested 250 even though the resource request amount is 200m. That is, the scheduler appears to be scheduling pods using the resource request amount, but then actually allocating more than that when they're scheduled causing pods to not be able to be deployed.

I had seen this error before when using Vertical Pod Autoscaling on the cluster. I have since disabled Vertical Pod Autoscaling and have completely deleted and recreated the node pool (s) several times. However, the error continues to persist.

Is there any other setting that could be causing the scheduler to inflate the resource requests beyond the requested amount?

The cluster and node is currently using GKE version 1.29.6-gke.1254000

krm104

Update on Feb 28 2025

We performed an update of the GKE version to 1.32.1 on Feb 21, 2025.
This GKE version includes a fix for
https://github.com/kubernetes/kubernetes/issues/115325

This update appears to have resolved the issue.

View solution in original post

krm104

I have changed the autoscaling profile of the cluster from 'Optimize Utilization' to 'Balanced' and that seems to have helped limit the occurrence of the issue. However, the problem still occurs under high workload volumes.

garisingh

And are you not seeing new nodes created via the Cluster Autoscaler?

krm104

Hi @garisingh , thank you very much for the question. The nodes are indeed scaling correctly. The issue actually occurs once the node pool reaches it's maximum number of nodes. What is expected, from my understanding, is that workloads would then remain in a pending state awaiting node resources to become available (from other pods completing) before being scheduled. But the scheduler seems to be inflating the required resources beyond the resource request property causing the error.

The node pool is setup using the cluster autoscaler -> any location policy -> per zone limits (0 -20) over four zone in us-central1 (a, b, c, f).

Perhaps the issue is with the cluster level autoscaling profile of 'optimize utilization'? I encountered an issue with scheduling errors from over resource inflation when trying out the vertical pod autoscaler. However, we have since disabled the vertical pod autoscaling. We did just recently update the cluster to GKE version 1.29.# from 1.28. It is the case that the incidents of over inflation did start to occur afterwards.

krm104

Update on Feb 28 2025

We performed an update of the GKE version to 1.32.1 on Feb 21, 2025.
This GKE version includes a fix for
https://github.com/kubernetes/kubernetes/issues/115325

This update appears to have resolved the issue.

krm104

I still have not found a resolution to this issue. It is intermittent, and I have not been able to identify why it comes and goes. The cluster we are using has been in place for a number of years. Vertical pod autoscaling is turned off, we have created new node pools within the cluster, we have disabled and re-enabled autoscaling, we have adjusted the autoscaling profile between optimized and balanced, and we still get occurrences of

" Pod was rejected: Node didn't have enough resource: cpu,...."

Where the CPU amounts are above those in the pod definition limits, indicating that something within the system is inflating the resource requests after the pod has been assigned.

There is no indication there is a VPA controller or other configuration set on the cluster. Short from completing deleting the cluster and building it new, we have tried every logical combination.

Has anyone successfully used the vertical pod autoscaler, or more importantly, successfully disabled it after encountering issues with pod rejections due to resource limit changes post scheduling?

rwainman

Three quick suggestions:

1. Check that VPA is fully disabled at the cluster level (via the --enable-vertical-pod-autoscaling flag).

2. If I remember correctly there are labels/annotations applied by VPA when it modifies a pod's resource requests. Would be worth double checking the pending pod to see if any were applied.

3. Check to make sure there are no VPA objects in the cluster at all. If none are present then the modification of resource requests is not caused by VPA as it uses the values stored in those objects to apply recommendations.

krm104

Hi @rwainman , thank you so much for your post, I greatly appreciate the reply.

You are correct on all three accounts, there is no VPA active, there are no annotations.

I did some more tracing, and I believe the issue is actually related to this bug in kubernetes itself that looks like may be resolved in version 1.32.

https://github.com/kubernetes/kubernetes/issues/115325

We will have to wait for an available upgrade on the regular channel in GKE to 1.32+ before we can try and confirm that though.

Vertical Pod Autoscaling Disabled but Seeing Pod was rejected: Node didn't have enough resource: cpu