Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

kube-system pods overuse node CPU

It's already a second time I experience the following issue. One of the GKE nodes becomes unhealthy (user pod's health checks often fails, as well as other calls to them). The root cause seems that kube-system pods consume all available CPU (> 5 CPU cores on 8 CPU core node, CPU usage is constantly ~100% during this). This can last for days and does not self-resolve. The only solution I've found is to re-create this node. Please advise about the possible root cause.

Node k8s version: v1.25.12-gke.500

0 5 786
5 REPLIES 5

Do you know which specific pod inside the kube-system namespace is using all available CPU ?

 

Also which version of GKE are you on ?

@abdelfettah thank you for the fast reaction.

This is a screenshot from k9s that was made when the problem existed.

sergii24_0-1710316496210.png

The sum of CPU usage here doesn't equal to ~8, but overall it was on that level (I guess it's a known issue). But even here you can see that more than 5 CPU cores are used by kube-system.

Hmm that looks odd indeed. I tried to replicate but could not. My best advice would be to try to reach out to support and look at the logs of these pods to try to understand what are they doing!

@abdelfettah that's not easy to reproduce. On 3 similar nodes during ~3 months, this happened only twice. So, at average with one node it's 4-5 months to wait for the issue to happen. 

To those who stumbled upon this problem and is seeking for a root cause and a solution... It seems to be related to TCP OOM. This is the output from `dmesg`: `TCP: out of memory -- consider tuning tcp_mem`.

Looks like, an issue happens when the node works near it's RAM limit.

So, consider expanding node's RAM or check the solutions from this article if you cannot: https://cloud.google.com/compute/docs/networking/tcp-optimization-for-network-performance-in-gcp-and...

Top Labels in this Space
Top Solution Authors