Solved: Re: LoadBalancerNegNotReady needs to time out befo...

piotrfue · 02-25-2025 04:50 AM

I have a relatively simple cluster setup:

An ingress to handle certifcates and for exposing the service to the public internet
The ingress is linked to two NodePort services, which each expose a deployment via HTTP2
Two deployments, which each consist of a single pod which serves a GRPC service.
The deployment share multiple mounts of a shared standard-rwx volume.

In most situations, the setup works as expected. However, it takes unexpectedly long when restarting the deployment (rollout restart) or when updating the pod images.

Example with rollout restart:
When I run describe on the newly created pod, immediatly after execting the restart command, I get the following:

Events:
Type Reason Age From Message 
---- ------ ---- ---- ------- 
Normal LoadBalancerNegNotReady 36s neg-readiness-reflector Unable to determine the pod's node name. 
Normal Scheduled 36s gke.io/optimize-utilization-scheduler Successfully assigned default/canary-deployment-658d966c8d-jvszn to gk3-nsp-cluster-pool-1-19cf81c9-shll 
Normal LoadBalancerNegNotReady 36s neg-readiness-reflector Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-84ef8c22-default-canary-service-5000-61b01fd8]

The pods then stay in the Container Creating state for 15 minutes. After 15 minutes, apparently a timeout expires and the startup continues. This is the result of the describe command 26 minutes later:

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal LoadBalancerNegNotReady 26m neg-readiness-reflector Unable to determine the pod's node name.
Normal Scheduled 26m gke.io/optimize-utilization-scheduler Successfully assigned default/canary-deployment-658d966c8d-jvszn to gk3-nsp-cluster-pool-1-19cf81c9-shll
Normal LoadBalancerNegNotReady 26m neg-readiness-reflector Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-84ef8c22-default-canary-service-5000-61b01fd8]
Normal Pulled 11m kubelet Container image "IMAGE_TAG:1.0.0" already present on machine
Normal Created 11m kubelet Created container canary-engine
Normal Started 11m kubelet Started container canary-engine
Normal LoadBalancerNegTimeout 11m neg-readiness-reflector Timeout waiting for pod to become healthy in at least one of the NEG(s): [k8s1-84ef8c22-default-canary-service-5000-61b01fd8]. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

The service running in the container starts logging 15 minutes after the restart command. Thus I have the impression, that it takes the full 15 minutes to provision the container. Why do the checks fail/time out? How can I diagnose this issue?

piotrfue

Hi kensan,

It turned out that setting the securityContext delayed the deployment startup.

 securityContext:
  fsGroup: 1000

This was an "remnant" from an earlier development stage of the service, which apparently started to cause issues once we migrated to rwx/FileStore volumes. Unfortunately there was no indication in the logs that this takes unusually long, but well. The containers are no starting up within a couple of seconds.

Thank you for your help!

View solution in original post

kensan

Hi @piotrfue,

Welcome to Google Cloud Community!

You are receiving an error “eg-readiness-reflector Timeout waiting for pod to become healthy in at least one of the NEG(s): [k8s1-84ef8c22-default-canary-service-5000-61b01fd8]. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.” This means that the deployment's health checks are failing and the possible cause is the container image might be bad or the health check might be misconfigured.

If you're using kubectl 1.13 or higher, you can check the status of a Pod's readiness gates with the following command:

kubectl get pod POD_NAME -o wide

Check the READINESS GATES column.

This column doesn't exist in kubectl 1.12 and lower. A Pod that is marked as being in the READY state may have a failed readiness gate. To verify this, use the following command:

kubectl get pod POD_NAME -o yaml

Load balancer health checks are specified per backend service. Each of the backend services corresponds to a Kubernetes Service, and each backend service must reference a Google Cloud health check. This health check is different from a Kubernetes liveness or readiness probe because the health check is implemented outside of the cluster. You can check this documentation for troubleshooting:

If the issue is not resolved, it is recommended to contact Google Cloud Support.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

piotrfue

Thank you for the links - I read them and I think my backend/health check configuration is fine. I've double checked whether my current configuration is working by applying the same manifests on a new cluster. Here I did not observe the same behavior.

In a second test, I removed the ingress, and consequently also the health checks to verify that the health checks are truely the bottleneck. It turned out, that the pods stay in the "Scheduled" state for the complete 15 minutes, and in some situations even longer. From that I conclude, that the issue is not a problem with the health checks. Instead, the pods do not get the required resources to start up within the 15 minute health check window and eventually get marked as healthy, regardless of their actual state.

The events below show an example of the pod startup without ingress/health checks. As the startup takes more than 15 minutes, I would expect the same NEG health check for this container too if it was connected to an ingress:

Events:
  Type    Reason     Age   From                                   Message
  ----    ------     ----  ----                                   -------
  Normal  Scheduled  17m   gke.io/optimize-utilization-scheduler  Successfully assigned default/canary-deployment-5c8bc58857-mnrd9 to gk3-nsp-cluster-pool-1-94988cc2-y5kv
  Normal  Pulling    36s   kubelet                                Pulling image "europe-west3-docker.pkg.dev/***/***/***:1.0.1"
  Normal  Pulled     33s   kubelet                                Successfully pulled image "europe-west3-docker.pkg.dev/***/***:1.0.1" in 2.389s (2.389s including waiting). Image size: 122619667 bytes.
  Normal  Created    33s   kubelet                                Created container canary-engine
  Normal  Started    33s   kubelet                                Started container canary-engine

I am aware that this changes the context of the question - but do you know how I can reduce the time that the pod is in the scheduled state?

kensan

Hi @piotrfue,

If you are using a standard cluster you can utilize image streaming to enable your workloads to start without waiting for the full image to download, resulting in notable reductions in initialization times. For Autopilot clusters must run GKE version 1.25.5-gke.1000 or later to have Image streaming automatically enabled.

piotrfue

Hi kensan,

It turned out that setting the securityContext delayed the deployment startup.

 securityContext:
  fsGroup: 1000

This was an "remnant" from an earlier development stage of the service, which apparently started to cause issues once we migrated to rwx/FileStore volumes. Unfortunately there was no indication in the logs that this takes unusually long, but well. The containers are no starting up within a couple of seconds.

Thank you for your help!

ssumathe

We have the same situation in production, is it okay to remove SecuritContext: fsgroup in prod environment ? Even we are clueless on exactly which file is causing the problem there is nothing concrete in gke logs event etc..

piotrfue

In our setup it was the number of files which increased over time. I assume, that the time required for applying the file permissions increased with the number of affected times.

I do not know whether it is ok to remove the security context in your application. I would assume this mainly depends on your setup.

LoadBalancerNegNotReady needs to time out before pod starts up regularly