Solved: Re: Pod bursting still not available to my autopil...

yanqiang · 08-09-2024 01:01 PM

Hi,

My GKE Autopilot cluster was created in version `1.27.3-gke.100` and have been updated to ` 1.30.2-gke.1587003` which is supposed to have pod bursting re-enabled according https://cloud.google.com/kubernetes-engine/docs/how-to/pod-bursting-gke#availability-in-gke.

BTW, all worker nodes are in v1.30.2-gke.1587003 version too.

However, it seems like pods are still in Guaranteed QoS class, even for the test pod in the doc.

```

kdp helloweb-5b78557f66-s45gc | grep QoS
QoS Class: Guaranteed

```

Can someone help me figure out what's going on there? Thanks

shannduin

The section that I linked to "Limitations" has the instructions, basically you need to `gcloud container cluster upgrade --master` the cluster to the same GKE version that it's already on, which will trigger a control plane restart 🙂

View solution in original post

garisingh

Can you share your deployment spec?

yanqiang

Sure. I just use the example in the doc: https://cloud.google.com/kubernetes-engine/docs/how-to/pod-bursting-gke#deploy-burstable-workload

apiVersion: apps/v1
kind: Deployment
metadata:
  name: helloweb
  labels:
    app: hello
spec:
  selector:
    matchLabels:
      app: hello
      tier: web
  template:
    metadata:
      labels:
        app: hello
        tier: web
    spec:
      nodeSelector:
        pod-type: "non-critical"
      tolerations:
      - key: pod-type
        operator: Equal
        value: "non-critical"
        effect: NoSchedule
      containers:
      - name: hello-app
        image: us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 250m
          limits:
            cpu: 350m

jastes

+1, we are having the exact same issue, with the same tests done and the same results acquired

Simelvia

I'm having same problem with v1.30.2-gke.1587003.

My cluster was created in v1.29.6-gke.1326000, then upgraded to v1.30.2-gke.1587003.
Node version is also v1.30.2-gke.1587003.

However after following the documentation, QoS class for helloweb pod turns out to be "Guaranteed".

shannduin

@Simelvia @jastes @yanqiang in the Limitations section of the doc, there's instructions to manually restart the control plane, which must happen after your nodes all run a supported version. Could you confirm if you've manually restarted the control plane after the version upgrade completed in your nodes? Just to check, could you try doing that once more and redeploy the Pod to see if that works?

yanqiang

Hi @shannduin , the doc didn't mention how to actually trigger the manual restart. It only mentions `kubectl get nodes` which I did and all pods are in the right version.

shannduin

The section that I linked to "Limitations" has the instructions, basically you need to `gcloud container cluster upgrade --master` the cluster to the same GKE version that it's already on, which will trigger a control plane restart 🙂

yanqiang

Thanks. I've upgrade the k8s cluster to a even newer version and I guess that restarted the control plan. Now the pod bursting is working. Thanks!

jastes

Thank you for your advise, @shannduin , my deployment is burstable now. But there is still a problem on a matter, why we needed Bursting in the first place. We wanted to be able to allocate smaller resources for our multiple micro-deployments, but that still seems not possible. I applied the exact same file as described in docs for a sample burstable workload, but just specified smaller resources:

requests:
    cpu: 25m
    memory: 128Mi
limits:
    cpu: 50m
    memory: 256Mi

But nevertheless it automatically modifies to enormously large values:

autopilot.gke.io/resource-adjustment: '{"input":{"containers":[{"limits":{"cpu":"50m","ephemeral-storage":"1Gi"},"requests":{"cpu":"25m","ephemeral-storage":"1Gi","memory":"512Mi"},"name":"hello-app"}]},"output":{"containers":[{"limits":{"cpu":"500m","ephemeral-storage":"1Gi"},"requests":{"cpu":"500m","ephemeral-storage":"1Gi","memory":"512Mi"},"name":"hello-app"}]},"modified":true}'

Why can that happen and how do I overcome it? Thank you in advance.

Simelvia

Maybe specifying requests above 50m CPU might help. As specified in Resource requests in Autopilot#MimumAndMaximum, (50m CPU, 52MiB Memory) is minimum request for general-purpose compute class.

Following shannduin's instruction, I was able to request 50m CPU & 52MiB Memory.
1. Upgrade autopilot cluster.
2. Node will be auto upgraded.
3. Do 1 again to manually restart control plane again.

shannduin

Yup, this is correct

rehan2

This is super annoying, we've been at it for a couple of days now.

Issue:
On deploying a burstable pod, we get the autopilot resources adjustment warning and the CPU & Memory limits are not respected.

QoS Class: Burstable

Our nodes version: v1.30.3-gke.1639000

Initial Version: 1.29.7-gke.1008000

Release Channel: Rapid

Answers:
Yes, we manually restarted the control plane after the upgrade to the latest node version based on suggestions by @shannduin

We're using Google's pod example to test: https://cloud.google.com/kubernetes-engine/docs/how-to/pod-bursting-gke

What can we do to resolve this?

rehan2

Any updates here?

shannduin

There's a solution in this post

rehan2

The one with the control plane restart that you shared? That didn't work for me. can you point me to the solution you're referring to?

shannduin

Wdym by the cpu and memory limits aren't respected? Did it adjust your limits to be equal to requests? Could you post the modified manifest?

rehan2

Yes, as soon as I deploy the pod (copied from the URL), I get Autopilot Mutator Warning that the CPU resources have been adjusted to meet minimum requirements.

Here's the pod it creates: https://gist.github.com/thesrs02/b4ebbce82340d82b140db2595bf3b840

shannduin

Hey, I gave this a go and confirmed. If I manually adjust the request to `500m` and set the limit to a higher value like `750m` it works as expected. I'll check if there's an explanation and get back to you

rehan2

I'm trying to set it to 50m or 250m, not 500m. I know it won't throw a warning on 500m.

shannduin

I get that, still needed to check to be sure

rehan2

It seems to have resolved on its own for some reason. Quick question, by default, will every pod be a burstable class?

shannduin

Only if your limits are different to your requests. If you explicitly set requests == limits, they'll be Guaranteed QoS.

https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#resource-limits

But that's weird that it resolved itself, did you do anything differently to the first time?

rehan2

No, nothing at all. Was waiting for updates here.

shannduin

@rehan2 so the defaulting happened because the workload in the doc had a nodeSelector and a toleration, which means that it uses workload separation. Autopilot enforces higher minimums (500m CPU) for workload separation (see https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#workload-separa...). So it was working as intended. We'll update the doc to remove that from the manifest, since the example Pod's requests are <500m CPU.

shannduin

@rehan2 just closing it off here, updated https://cloud.google.com/kubernetes-engine/docs/how-to/pod-bursting-gke#deploy-burstable-workload so that the manifest doesnt use workload separation

Pod bursting still not available to my autopilot cluster after upgrading