Re: GKE loosing metrics server connection - unknow...

andre_guimarae1 · 07-11-2023 11:09 AM

I have been working in this issue since last week and I really appreciate some support.

The issue started after upgrade cluster/nodes to 1.25.9-gke.2300 from 1.21, after that the application have been going down several times a day and the same time I have noticed that HPA have not show limits/resources from metrics server (metrics-server-v0.5.2).
After command kubectl get hpa, the response to all POds are limits "unknown".
At the same time we can notice all PODs at restart mode and after some minutes everything goes normal.

The cluster autoscaling has a limit of 40 nodes, during the day autoscaling goes up at 20~30 nodes and at night the autoscaling goes down to 4 nodes.

Instance N1 Custom 8vCPU and 16DRAM

Limits / Resources are set to all PODs and GKE Workloads at Risk (Cloud Monitoring) has no workloads at risk.

We are using limit of 70 PODs by node, but we can found 20 ~ 25 PODs by node.

Thanks for suggestions!

andre_guimarae1

Working on the issue I have found that one of the nodes (not that where metrics server are running) has two events:

Link DOWN
Lost carrier

At the same time I noticed that new pods where assigned to node and using a lot of network traffic...

The pods are using Alpine 3.15.4 and after some research could notice that this release have some issues with k8s v1.25.

CGROUP Fail
kube-proxy fail

We are scheduling to change the Alpine to 3.17-stable

We are still working on this issue.

GKE loosing metrics server connection - unknown target/using - Cluster 1.25.9-gke.2300