Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Calico-node cant start up due to not enough resources on node and blocks other services to terminate

We have seen issues where calico cant startup due to not enough resources available. Kubernetes will try to remove some pods to give calico space but due to calico being down these nodes gets stuck in terminating and cant be killed until you manually force delete them and then calico will have enough available resources to startup.

  Warning  FailedKillPod  2m2s (x138 over 31m)  kubelet  error killing pod: failed to "KillPodSandbox" for "bc24ce80-ad39-440d-9261-3b19542ef29c" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"multiple-round-service-74ddb6d49d-lc7fj_default\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"

 Its really hard to reproduce the problem and put calico in this state. Any ideas on how to fix this properly. Was thinking of writing a script cleaning this up but this is happening more often now and it doesnt feel right.

It looks like all calico-node are restarted/re-deployed at the same time when the issue is occured, but we dont get the issue everytime or on every node.

6 4 5,336
4 REPLIES 4

We started encountering the issue after we migrated from v1.21 to v.22. This issue tracker seems to be related https://issuetracker.google.com/issues/239154504.

Thanks did not know about the issuetracker site. I also found this which describe the same problem https://issuetracker.google.com/issues/237566158

We have the same issue in production after we migrated from V1.21 to V1.22. Restarting a calico-node or kubelet doesn't help. The only way to fix it is a decrease in the number of cpu.requests for calico-node daemonset.

Same issue in same condition 1.21-> 1.22. We are testing this workaround:

change the CM calico-node-vertical-autoscaler with small values for calico-node.requests.cpu.base & calico-node.requests.cpu.max to avoid other rolling update and to be sure that the pod start correctly on each nodes. 

calico-node-vertical-autoscaler must be restarted after the CM edition. 

Top Labels in this Space
Top Solution Authors