GKE Pod randomly restarts without error

JLloyd · 02-07-2024 10:59 AM

Hello all,
I am having an issue with a GKE pod and I'm trying to diagnose it.
The pod in question serves as the main entry point for our websites.

The issue is that the pod on one of our sites keeps randomly restarting. Sometimes it'll restart once a day and sometimes several times in a single hour. There doesn't seem to be any rhyme or reason to it.
Each time it does, it brings our website offline for 10 - 20 seconds while a new one starts up.

There's nothing in the logs, and when I run `kubectl get events` all, it says is that the readiness and liveness probes failed. As such, I don't know how to find out what is going on.

When I look at the CPU/Memory of the pod, there are often little to no changes at the times of restart, and there's plenty of resources.

Also, as stated above, this is only on one of our sites. Other sites with this same pod image do not have issues with the restarting.

How can I diagnose this further? What resources do I have to examine what is going on in a pod to cause it to restart without errors or warnings?

Thanks for any help

garisingh

Do you have workload logging enabled?

JLloyd

Yes I do. I am getting logs from the workload on lots of other things, including web traffic information, and other errors that occur (even though they don't cause it to restart).
Also, when it restarts, I get all the logging of the start up process.

There are just no errors that anything went wrong.
The logs simply show traffic logging as normal, and then I can see where it restarted because of the startup process logs. Just no logs that would indicate why or what went wrong to cause it to restart.

lawrencenelson

Greetings @JLloyd,

You mentioned that you're getting the error along the lines of "readiness and liveness probes failed".

Readiness and Liveness checks fail because of:

Connection Refused: This typically indicates that the container isn't listening on the expected port. Resolving this issue requires ensuring that the application within the container is set up to accept connections on that port. Additionally, it's possible that the probe configuration specifies an incorrect port.
Context Deadline Exceeded: This error generally occurs when the kubelet doesn't receive a response within the timeoutSeconds specified. The troubleshooting steps for this error vary depending on the probe type. For example, with an exec type probe, it might indicate that the command executed is taking longer than anticipated to run.
HTTP Probe Failed with StatusCode: This error signifies that the server is technically responding to requests, but it's returning an unexpected error code. This issue is specific to the application's behavior, meaning an investigation into the HTTP status codes returned by the application is necessary.

Can you run the following query in your Logs Explorer so we can troubleshoot further:

log_id("events")
resource.type="k8s_pod"
resource.labels.cluster_name=*CHANGE TO YOUR CLUSTER NAME*
jsonPayload.message=~"Liveness probe failed"

Also, kindly post the result of kubectl describe pod *POD_NAME* here.

JLloyd

Hello and thank you for the response.

Thank you for the query. I didn't know how to acquire this.
I ran the above query and see this repeated numerous times.

context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Here is the output of the pod description

Name: alakazam-web-84f7587ff7-ztc4f
Namespace: default
Priority: 1000000
Priority Class Name: high-priority
Service Account: default
Node: gke-alakazam-cluster-production-prima-96f0562c-zfp9/10.0.0.1
Start Time: Thu, 08 Feb 2024 21:42:43 +0000
Labels: app=web
nonPreemptible=true
pod-template-hash=84f7587ff7
Annotations: <none>
Status: Running
IP: ****
IPs:
IP: ****
Controlled By: ReplicaSet/alakazam-web-84f7587ff7
Containers:
web:
Container ID: containerd://****
Image: us.gcr.io/kordata-devops/kordata/web:web-master
Port: 8080/TCP
Host Port: 0/TCP
State: Running
Started: Mon, 12 Feb 2024 17:58:16 +0000
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Mon, 12 Feb 2024 17:39:16 +0000
Finished: Mon, 12 Feb 2024 17:58:16 +0000
Ready: True
Restart Count: 31
Limits:
cpu: 800m
memory: 750Mi
Requests:
cpu: 400m
memory: 750Mi
Liveness: http-get http://:8080/api/health delay=30s timeout=2s period=30s #success=1 #failure=2
Readiness: http-get http://:8080/api/health delay=30s timeout=3s period=30s #success=1 #failure=2
Environment:
****
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lgwxl (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-lgwxl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
non-preemptible-pool=true:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 27m (x470 over 3d23h) kubelet Readiness probe failed: Get "http://10.24.130.13:8080/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 5m6s (x948 over 3d23h) kubelet Liveness probe failed: Get "http://10.24.130.13:8080/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Mileke

Hi, I am currently facing this exact same issue. In the logs, I'm seeing the error

"message: "Liveness probe failed: Get "http://10.33.19.5:9020/api/health": dial tcp 10.33.19.5:9020: connect: connection refused"

"

But when I try to send that request to the pod IP on the port, I get an UP response immediately

"/ # curl http://10.33.19.5:9020/api/health
{"status":"UP","groups":["liveness","readiness"]}
/ # "

I've also tried setting the timeoutseconds to 30s and failure threshold to 10, incase it's just not getting a response in time, but it's still restarting.

What could be the issue here?

Mileke

Hi, I am also facing this exact issue.

On running that query, I'm getting this in the logs for the pods of 2 particular deployments
message: "Liveness probe failed: Get "http://10.33.19.5:9020/actuator/health": dial tcp 10.33.19.5:9020: connect: connection refused"

I've confirmed that this endpoint is actually working by trying to call it from another pod and I get the response:
/ # curl http://10.33.33.6:9020/actuator/health
{"status":"UP","groups":["liveness","readiness"]}
/ #

I have tried to make provision for slow responses by setting the timeoutseconds to 30 and the failurethreshold to 10, but it still ends up restarting.
What might be the issue here?

pgstorm148

You're facing a frustrating issue with random pod restarts, especially when there's no clear indication in the logs or resource metrics. Let's delve into a systematic troubleshooting approach to pinpoint the root cause.

1. Deep Dive into Readiness and Liveness Probes

Examine Probe Definitions:
- Carefully review the readinessProbe and livenessProbe definitions in your pod's YAML.
- Pay attention to:
  - httpGet, tcpSocket, or exec types.
  - initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, and failureThreshold values.
  - The specific path or command being executed by the probes.
Simulate Probe Execution:
- If you're using httpGet or tcpSocket probes, try manually accessing the specified path or port from within the pod using kubectl exec.
- If you're using exec probes, manually run the command inside the pod.
- This will help you determine if the probe itself is failing or if there's an underlying issue.
Increase Probe Verbosity:
- If possible, modify the probe to log more detailed information. For example, if it's an HTTP probe, log the response code and body. If it's an exec probe, log the command's output and exit code.

2. Network Connectivity Issues

DNS Resolution:
- If your application relies on external services, ensure that DNS resolution is working correctly within the pod.
- Use kubectl exec to run nslookup or dig commands.
Network Latency or Packet Loss:
- Use kubectl exec to run ping or traceroute commands to check for network latency or packet loss.
- If your application connects to a database, or other networked service, ensure that the network connection is stable.
Firewall Rules:
- Check if there are any firewall rules that might be blocking network traffic to or from the pod.

3. Application-Specific Issues

Resource Leaks:
- Even if overall CPU and memory usage seems normal, there might be resource leaks within the application that eventually lead to probe failures.
- Use application-specific monitoring tools or profiling tools to check for memory leaks, file descriptor leaks, or other resource exhaustion.
Concurrency Issues:
- If your application handles concurrent requests, there might be race conditions or deadlocks that cause it to become unresponsive.
External Dependencies:
- If your application relies on external services (databases, APIs, etc.), ensure that those services are stable and responsive.
- If the external service has a period of high latency, or is unavailable, this could cause your pods probes to fail.
Configuration Differences:
- You mention that other sites using the same image do not have the same issue. This strongly suggests a configuration difference.
- Check for differences in:
  - Environment variables.
  - ConfigMaps.
  - Secrets.
  - Mounted volumes.
- Even slight differences in configuration can cause drastically different behavior.

4. Kubernetes-Specific Issues

Node Issues:
- Although unlikely, there might be underlying issues with the node where the pod is running.
- Try scheduling the pod to a different node by using node affinity or tolerations.
Kubernetes Network Policies:
- Review any network policies that might be affecting the pod's network traffic.
Storage Issues:
- If the application is writing to a persistent volume, there could be underlying storage issues that are causing the application to hang.

5. Enhanced Logging and Monitoring

Application-Level Logging:
- Increase the verbosity of your application's logging.
- Add logging for critical sections of code, especially around the probe execution paths.
Kubernetes Events:
- Use kubectl describe pod <pod-name> to get more detailed information about the pod's events.
Container Runtime Logs:
- Check the container runtime logs on the node where the pod is running.
Google Cloud Monitoring:
- Utilize google cloud monitoring to get a deeper understanding of the pods metrics.
- Create custom metrics, and alerts.

Troubleshooting Steps

Examine Probe Definitions: Start by thoroughly reviewing the readiness and liveness probe definitions.
Simulate Probe Execution: Manually test the probes to ensure they are working as expected.
Check Network Connectivity: Verify DNS resolution, network latency, and firewall rules.
Investigate Application-Specific Issues: Look for resource leaks, concurrency issues, and external dependency problems.
Compare configurations: compare the configuration of the working pods, to the non working pods.

By following these steps, you should be able to gather more information and pinpoint the cause of the random pod restarts.