Re: Unable to deploy to GCP kubernetes cluster.

ankushjangid · 03-04-2024 10:05 PM

Hi, I have a kubernetes cluster on GCP that is running on Autopilot mode. Now this already had multiple services running for some time. Now I am trying to add 3 new services, but I am unable to deploy. I always get this -

```Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown```

The thing is this is random, say I delete 1 service then rescale other service, this error might go, but it will transfer to another container that is trying to come up. Can someone please help me regarding this? What can I do here?

dionv

Hello @ankushjangid ,

Regarding your concern, I found this article by PjoterS, which they pointed out that it might be the Flannel that is causing the issue.

Flannel runs a small, single binary agent called flanneld on each host, and is responsible for allocating a subnet lease to each host out of a larger, preconfigured address space.

As a proposed solution, they provided "You have to make flannel pods works correctly on each node."

barryboot

Hi
I am experiencing the exact same thing. I'm not using autopilot and I am using Istio with the CNI plugin.
Workloads run fine and when nodes are restarted and then after a couple of days this starts happening.

Have you had any luck in solving this?

garisingh

Which GKE version are you running?

barryboot

Hi, I'm running:

1.29.1-gke.1589018 with Dataplance V2.

alextricity25

I'm also experiencing the same thing. When I try to create new Deployments, the pods get stuck in the `Init:0/1` status. It happens intermittently with no obvious root cause. The pod events show this error:

FailedCreatePodSandBox pod/xrdm-portal-6bfb74964d-kpqpc Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

I'm running on GKE version 1.27.8-gke.1067004 on the regular release channel, with autoscaler on.

It's as if the auto scaler doesn't scale up as it should. I've noticed that I tend to receive this error more often when the cluster has only 3 nodes. However, when it decides to scale to 4, I don't see this issue as much. I've also noticed that he pods resume normal operation and get "unstuck" when I delete some other Deployments.

I'm happy to provide more info if needed.

alextricity25

I rebuilt my node pool, and the problem went away. I noticed that the pods that were "stuck" and gave this error were landing on the same node. Once I rebuilt the node pool, I no longer had this issue.

barryboot

Yes, that is expected. Every time I restart my node pool it works again but starts failing after a couple of days.

alextricity25

I'll keep an eye on my node pool to see if it happens again. However, the problem for me seemed to be specific to one node. Is there a difference between restarting and rebuilding? I rebuilt my node pools, meaning that I destroyed them and started a new one.

alextricity25

I spoke too soon! This issue started appearing again on other node. Here is the error I observed in Log Viewer:

E0401 12:49:42.096860 2007 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"konnectivity-agent-79db94fb79-cftdt_kube-system(871461e8-c40b-4876-8a85-4e17eb56369c)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"konnectivity-agent-79db94fb79-cftdt_kube-system(871461e8-c40b-4876-8a85-4e17eb56369c)\\\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown\"" pod="kube-system/konnectivity-agent-79db94fb79-cftdt" podUID=871461e8-c40b-4876-8a85-4e17eb56369c

Pods would stay stuck in the `Init:0/1` status with no other indication of what may be going on.

I'm going to try upgrading my GKE cluster to 1.28.7 to see if that does anything.

alextricity25

I just had this happen to me again today 😕

Amadeu

Hi all,

I am having the same problem in GKE Autopilot Cluster version 1.31.1-gke.2105000. I noticed that all pods having this issue were allocated to the same node and draining it allowed the pods to start with no problem. I can't understand what caused this and then I had some pods allocated to this "faulty" node start smoothly.

Any help on how to fix this is appreciated.

juriku

Same is happening on non-autopilot 1.28 and 1.30 GKE (v1.30.5-gke.1443001).

I tried ubuntu and COS nodes, as in other forum suggested that there is a fix: https://github.com/torvalds/linux/commit/a1140cb215fa13dcec06d12ba0c3ee105633b7c4
But that patch is in my current '5.15.0-1067-gke' ubuntu nodes, however the same is happening.

I'm managing it by making my nodes not too big and limiting maximum allowed pods in one node (most of time error happens on reaching 60-100 pods per node).

Also I increased bpf_jit_limit as a temporary fix, but memory leak will reach this limit later anyway.

```

rawYamlList:

- apiVersion: apps/v1

kind: DaemonSet

metadata:

namespace: kube-system

spec:

selector:

matchLabels:

template:

metadata:

labels:

spec:

hostPID: true

hostNetwork: true

initContainers:

- name: sysctl

image: ubuntu:22.04

command: ["/bin/bash", "-c", "current_limit=$(sysctl -n net.core.bpf_jit_limit); if [[ $current_limit -lt 999999999 ]]; then sysctl -w net.core.bpf_jit_limit=999999999 ; fi; sysctl -n net.core.bpf_jit_limit"]

securityContext:

privileged: true

containers:

- name: busybox

image: busybox

command: ["sleep", "infinity"]

```