Solved: How to remove an unhealthy node in Google Kubernet...

undefinedundefi · 03-26-2024 03:04 AM

My Kubernetes cluster running on GKE autopilot has an unhealthy node since a few days. The node has a `Ready` status, but all the pods running on it have a `CreateContainerError` status and seems to be stuck polling container images.

Example:
> (Normal Pulled 2m4s (x26987 over 4d1h) kubelet Container image "gke.gcr.io/cluster-proportional-autoscaler:v1.8.10-gke.3@sha256:274afbfd520aef0933f1fefabddbb33144700982965f9e3632caabb055e912c6" already present on machine).

Something went wrong with the node. I suspect it's because I did upgrade Kubernetes and my account ran out of SSD quota during the upgrade. I got more quota since and new nodes got created and the upgrade completed. It could be unrelated too.

I did "cordon" the node to mark it unschedulable, and manually deleted my pods from it. New pods got scheduled on healthier nodes, so not too bad and I could live with one broken node.

But I want to clean-up. The old pods I deleted were stuck in `Terminating` state, but force deleting them made them disappear.

I cannot do the same on the `kube-system` and `gke-gmp-system` namespaces. I see the "managed" pods with a `CreateContainerError` status, and they are pulling container images in a loop. One is also stuck with a `Terminating` status.

I would like to remove this node, and I drained it as the documentation says. But a few days later, it's still there.

How could I remove the unhealthy node?

undefinedundefi

A new kubernetes version eventually arrived, and upgrading the cluster took care of the unhealthy node.

Perhaps downgrading and upgrading the cluster could have worked too.

View solution in original post

acarino

Hi @undefinedundefi

Welcome to Google Cloud Community!

The kube-system and gke-gmp-system namespaces are being managed by Google. In this document, it is mentioned that GKE manages all the workload in these namespaces. It is safe to ignore the errors when draining these namespaces as GKE might also rate-limit your use of the <kubectl drain> command.

I hope this information is helpful.

If you need further assistance, you can always file a ticket on our support team.

undefinedundefi

Thanks for your reply. Do you know how I could remove the unhealthy node?

shannduin

If the node's still cordoned, could you try uncordoning it and re-running the drain command?

undefinedundefi

Violations details: {"[denied by autogke-no-node-updates]":["Uncordon on nodes is not allowed in Autopilot."]}

It looks like uncordoning is not possible unfortunately.

undefinedundefi

A new kubernetes version eventually arrived, and upgrading the cluster took care of the unhealthy node.

Perhaps downgrading and upgrading the cluster could have worked too.

shannduin

Nice! Maybe the solution then is to perform an upgrade to the same version that your cluster already runs (so you're not changing the version, just an in-place upgrade to trigger node recreation)

How to remove an unhealthy node in Google Kubernetes Engine Autopilot?