Re: GKE Autopilot 1.29 Warden seems to have bug in...

dels · 03-19-2024 03:50 PM

Hi,

We recently upgraded our cluster to 1.29 and we have immediately started seeing weird behaviors that seem to be related with how Warden is calculating resources.

We have seen multiple instances of what seems to be the same problem as of today.

(The disk requested is of size 10240)

admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints. Violations details: {"[denied by autogke-pod-limit-constraints]":["Max ephemeral-storage requested by init containers for workload 'example-flow-martin-tvctq-start-2572180891' is higher than the Autopilot maximum of '10Gi'.","Total ephemeral-storage requested by containers for workload 'example-flow-martin-tvctq-start-2572180891' is higher than the Autopilot maximum of '10Gi'."]}

We have also had an instance where the requested memory was 4096 and we got this.

Violations details: {"[denied by autogke-pod-limit-constraints]":["The memory to cpu requests ratio for workload 't-2ee3e40d-5w6d6' is '4.0001953125' is outside of the allowed range for Autopilot of '4 - 4'."]}

The same exact code was working perfectly yesterday on 1.28... but with 1.29 it seems the calculations are now inexact?

RonEtch

Hi @dels

Welcome to Google Cloud Community!

Regarding on your error "Max ephemeral-storage requested by init containers for workload is higher than the Autopilot maximum of '10Gi'". You may check this guide since it seems that your deployments reached the maximum cumulative value of storage requests across all containers.

The other error "The memory to cpu requests ratio for workload is outside of the allowed range for Autopilot" can depend on the compute class you are using with your cluster.

I hope this information is helpful.

If you need further assistance, you can always file a ticket on our support team.

dels

Hi @RonEtch, I understand the "reasoning" behind the information you provided, but can you please explain why the behavior changed in 1.29 if it isn't a bug?

We were able to circumvent the problems we were facing by changing the said configurations and also using different compute classes, but can you please explain how the exact same workloads, with the exact same resource allocation would spawn up just fine on 1.28.x?

RonEtch

Are you still seeing the same errors as of now or during only the upgrade ?

dels

yes, the upgrade process itself went great, but as I explained, the new version 1.29 is not behaving like 1.28, and yes, 1.29 has been consistently erroring since the upgrade with the default resource requests that worked great in 1.28

RonEtch

It is hard to tell that it is a definite bug since there are other reasons that might lead to the errors that you received. I recommend that you file a ticket to our support team to further investigate the issue.

thatcher-w

We're also running similar weird issues. For the K8S init container it somehow sums the storage it requested it self (1GiB) with what we requested for the main container (10GiB), and now GKE-warden rejects the image because 11 > 10G. We didn't see this on GKE < 1.29.

mlujan

Hi guys, we're experiencing this issue on clusters 1.27. Since last week, we've been encountering the same problem when previously we were getting the message: `Adjusted resources to meet`. Have there been any changes in the API?

GKE Autopilot 1.29 Warden seems to have bug in its calculations