Solved: Re: How do I remove a bad node in a Autopilot clus...

henryblythwork · 04-21-2025 01:10 PM

Cluster version: v1.30.9-gke.1127000

For an unknown reason, since 2-3 days ago, one specific node that's been in my Autopilot cluster for 35 days keeps evicting containers with the following message: "The node was low on resource: ephemeral-storage. Threshold quantity: 10120387530, available: 9468104Ki." The container it keeps evicting is requesting `ephemeral-storage: 5G` and it only ever schedules one onto the node, so it should fit into 10gb easy.

I can't get rid of the bad node because `kubectl delete node ...` is blocked by the warden: "Operation on nodes is not allowed in Autopilot." This makes sense as Autopilot is meant to automatically provision nodes to match our requested resources. However, our requested resources have not changed in the past several days, nay the last 2 months.

There are no other pods on the node in the namespace we manage. All that ends up there is when Autopilot schedules one of our `ephemeral-storage: 5G` pods onto it and it inevitably fails, so it keeps trying until it eventually schedules it onto a good node that can run it. It keeps going in this cycle: to Autopilot, this one node is most suited for the resources defined in our pods.

How can I get rid of this bad node? The only recent change I can think of is we've added a ValidatingAdmissionPolicy to the cluster for the first time. Would that have anything to do with ephemeral-storage?

Perhaps interestingly, the node has a failing image-package-extractor container (in the kube-system namespace), evicted with the following message: "The node was low on resource: ephemeral-storage. Threshold quantity: 10120387530, available: 9760444Ki. Container image-package-extractor was using 84Ki, request is 0, has larger consumption of ephemeral-storage."

Does that give more information about this issue, or is that another symptom than the cause? The image-package-extractor containers on the other nodes are running just fine.

We are running two clusters with that same GKE version, on separate projects in the same region under the same organisation, but this bug is only occurring in one of the clusters, on one of the nodes. We're worrying that it'll spread to our other cluster and cause downtime to our systems.

Is there anything else I can investigate to figure this out? I'd love to just get rid of the node and have Autopilot provision a new one, but it won't let me 😭

garisingh

Not sure about the root issue at this point, but you can remove a problematic node on in Autopilot mode by draining the node:

kubectl drain NODE_NAME --ignore-daemonsets

View solution in original post

garisingh

Not sure about the root issue at this point, but you can remove a problematic node on in Autopilot mode by draining the node:

kubectl drain NODE_NAME --ignore-daemonsets

henryblythwork

Ah thanks! That's put it into SchedulingDisabled status, but it hasn't evicted any of the GKE managed pods. So that's effectively solved the issue because no new pods of ours will be scheduled onto it 😃 but it's still sitting there so I don't know when Autopilot will delete it 😞

henryblythwork

It eventually got deleted when the GKE version got updated and all nodes were replaced. I'm unsure if it would've been deleted otherwise, but it was effectively disabled so that was good enough 😃

How do I remove a bad node in a Autopilot cluster? "The node was low on resource: ephemeral-storage"