Hello all,
I am having an issue with a GKE pod and I'm trying to diagnose it.
The pod in question serves as the main entry point for our websites.
The issue is that the pod on one of our sites keeps randomly restarting. Sometimes it'll restart once a day and sometimes several times in a single hour. There doesn't seem to be any rhyme or reason to it.
Each time it does, it brings our website offline for 10 - 20 seconds while a new one starts up.
There's nothing in the logs, and when I run `kubectl get events` all, it says is that the readiness and liveness probes failed. As such, I don't know how to find out what is going on.
When I look at the CPU/Memory of the pod, there are often little to no changes at the times of restart, and there's plenty of resources.
Also, as stated above, this is only on one of our sites. Other sites with this same pod image do not have issues with the restarting.
How can I diagnose this further? What resources do I have to examine what is going on in a pod to cause it to restart without errors or warnings?
Thanks for any help
Do you have workload logging enabled?
Yes I do. I am getting logs from the workload on lots of other things, including web traffic information, and other errors that occur (even though they don't cause it to restart).
Also, when it restarts, I get all the logging of the start up process.
There are just no errors that anything went wrong.
The logs simply show traffic logging as normal, and then I can see where it restarted because of the startup process logs. Just no logs that would indicate why or what went wrong to cause it to restart.
Greetings @JLloyd,
You mentioned that you're getting the error along the lines of "readiness and liveness probes failed".
Readiness and Liveness checks fail because of:
timeoutSeconds
specified. The troubleshooting steps for this error vary depending on the probe type. For example, with an exec type probe, it might indicate that the command executed is taking longer than anticipated to run.Can you run the following query in your Logs Explorer so we can troubleshoot further:
log_id("events")
resource.type="k8s_pod"
resource.labels.cluster_name=*CHANGE TO YOUR CLUSTER NAME*
jsonPayload.message=~"Liveness probe failed"
Also, kindly post the result of kubectl describe pod *POD_NAME* here.
Hello and thank you for the response.
Thank you for the query. I didn't know how to acquire this.
I ran the above query and see this repeated numerous times.
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Here is the output of the pod description
Name: alakazam-web-84f7587ff7-ztc4f
Namespace: default
Priority: 1000000
Priority Class Name: high-priority
Service Account: default
Node: gke-alakazam-cluster-production-prima-96f0562c-zfp9/10.0.0.1
Start Time: Thu, 08 Feb 2024 21:42:43 +0000
Labels: app=web
nonPreemptible=true
pod-template-hash=84f7587ff7
Annotations: <none>
Status: Running
IP: ****
IPs:
IP: ****
Controlled By: ReplicaSet/alakazam-web-84f7587ff7
Containers:
web:
Container ID: containerd://****
Image: us.gcr.io/kordata-devops/kordata/web:web-master
Port: 8080/TCP
Host Port: 0/TCP
State: Running
Started: Mon, 12 Feb 2024 17:58:16 +0000
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Mon, 12 Feb 2024 17:39:16 +0000
Finished: Mon, 12 Feb 2024 17:58:16 +0000
Ready: True
Restart Count: 31
Limits:
cpu: 800m
memory: 750Mi
Requests:
cpu: 400m
memory: 750Mi
Liveness: http-get http://:8080/api/health delay=30s timeout=2s period=30s #success=1 #failure=2
Readiness: http-get http://:8080/api/health delay=30s timeout=3s period=30s #success=1 #failure=2
Environment:
****
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lgwxl (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-lgwxl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
non-preemptible-pool=true:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 27m (x470 over 3d23h) kubelet Readiness probe failed: Get "http://10.24.130.13:8080/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 5m6s (x948 over 3d23h) kubelet Liveness probe failed: Get "http://10.24.130.13:8080/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Hi, I am currently facing this exact same issue. In the logs, I'm seeing the error
"message: "Liveness probe failed: Get "http://10.33.19.5:9020/api/health": dial tcp 10.33.19.5:9020: connect: connection refused"
Hi, I am also facing this exact issue.
On running that query, I'm getting this in the logs for the pods of 2 particular deployments
message: "Liveness probe failed: Get "http://10.33.19.5:9020/actuator/health": dial tcp 10.33.19.5:9020: connect: connection refused"
I've confirmed that this endpoint is actually working by trying to call it from another pod and I get the response:
/ # curl http://10.33.33.6:9020/actuator/health
{"status":"UP","groups":["liveness","readiness"]}
/ #
I have tried to make provision for slow responses by setting the timeoutseconds to 30 and the failurethreshold to 10, but it still ends up restarting.
What might be the issue here?
You're facing a frustrating issue with random pod restarts, especially when there's no clear indication in the logs or resource metrics. Let's delve into a systematic troubleshooting approach to pinpoint the root cause.
1. Deep Dive into Readiness and Liveness Probes
2. Network Connectivity Issues
3. Application-Specific Issues
4. Kubernetes-Specific Issues
5. Enhanced Logging and Monitoring
Troubleshooting Steps
By following these steps, you should be able to gather more information and pinpoint the cause of the random pod restarts.