unconditional drop overload error on load balancer...

garango

Hi, close to the GCP outage, at 2025-06-12 13:13:40 EDT I started having issues on multiple load balancer rules. When accessing the URL that resolves to a load balancer created with Gateway API resources that points to specific backends I only see a unconditional drop overload error as a response.

The backends are created by Gateway API HttpRoutes and point to ClusterIP Kubernetes services that point to a healthy Kubernetes deployment, I know the deployment is healthy since it had enough resources to operate and the deployment pod could be accessed and responds through in-cluster requests, I also know that the requests made to the load balancer IPs would end there since I got these messages from the load balancer logs when trying to query the urls that reached the faulty backends:

jsonPayload: {
@type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
backendTargetProjectNumber: "projects/xxxxxxxxxxxx"
cacheDecision: [2]
remoteIp: "x.x.x.x"
statusDetails: "failed_to_connect_to_backend"
}

After looking at the load balancer backend services availability, I noticed none of the backends were actually available (0 of 0 in all backends) but I'm confused since there's no documentation on Google side for the unconditional drop overload, after further inspection I see it's an error related to Envoy Proxy, probably on GCP side.

Backends have been regenerating slowly after a few hours of the issue, but I am still having the same problem even when associating new HttpRoutes and services to our Gateways to these kind of workloads.

I'd be really grateful to know if this is an issue Google Cloud acknowledges, if there's a workaround or even if someone else is experiencing this issue.

Thanks in advance.

garango

Hi, after a lot of hours of debugging I found a solution for the issue, not really sure what combination of actions was the final solution but I'll tell you what I did:
I recreated all of the Gateway HttpRoutes associated to the every service that was having troubles to be connected to the load balancer, it did nothing, then I deleted the service and the pod and it didn't work, but I noticed the ServiceNetworkEndpointGroup wouldn't get deleted, even if the service was deleted.

I went ahead and inspected the resource, this is what I got:

Name:         k8s1-xxxxxx-nginx-redirect-service-xxxx
Namespace:    xxxx
Labels:       networking.gke.io/managed-by=neg-controller
              networking.gke.io/service-name=nginx-redirect-service-xxx
              networking.gke.io/service-port=80
Annotations:  <none>
API Version:  networking.gke.io/v1beta1
Kind:         ServiceNetworkEndpointGroup
Metadata:
  Creation Timestamp:             2024-11-13T12:13:06Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2025-06-13T01:28:00Z
  Finalizers:
    networking.gke.io/neg-finalizer
  Generation:  165
  Owner References:
    API Version:           v1
    Block Owner Deletion:  false
    Controller:            true
    Kind:                  Service
    Name:                  nginx-redirect-service-xxxx
    UID:                   xxxx-xxxx-xxxx-xxxx-xxxx
  Resource Version:        9701xxxxx
  UID:                     xxxx-xxxx-xxxx-xxxx-xxxx
Spec:
Status:
  Conditions:
    Last Transition Time:  2025-06-12T18:46:20Z
    Message:               googleapi: Error 503: Policy checks are unavailable., backendError
    Reason:                NegInitializationFailed
    Status:                False
    Type:                  Initialized
    Last Transition Time:  2025-06-12T18:14:03Z
    Message:               failed to get NEG for service: googleapi: Error 503: Policy checks are unavailable., backendError
    Reason:                NegSyncFailed
    Status:                False
    Type:                  Synced
  Last Sync Time:          2025-06-12T19:12:30Z
  Network Endpoint Groups:
    Id:                     xxxx
    Network Endpoint Type:  GCE_VM_IP_PORT
    Self Link:              https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-b/networkEndpointGroups/k8s1-xxxxxx-nginx-redirectwservice-xxx
    Id:                     yyyy
    Network Endpoint Type:  GCE_VM_IP_PORT
    Self Link:              https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-c/networkEndpointGroups/k8s1-xxxxxx-nginx-redirect-service-xxx
    Id:                     zzzz
    Network Endpoint Type:  GCE_VM_IP_PORT
    Self Link:              https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-f/networkEndpointGroups/k8s1-xxxxxx-nginx-redirect-service-xxx

The error "failed to get NEG for service: googleapi: Error 503: Policy checks " was weird since this resource is managed automatically by the service (NEG controller)

I proceeded to delete the ServiceNetworkEndpointGroup by myself:

kubectl delete servicenetworkendpointgroup k8s1-xxxxxx-nginx-redirect-service-xxx -n xxx

Then I noticed it had finalizers, it was deleting forever so I patched the finalizers:

kubectl patchk8s1-xxxxxx-nginx-redirect-service-xxx --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]' -n xxxx

The issue wasn't fixed, so I basically deleted and recreated the service and did the same for the deployment, after a few seconds-minutes, I could access the urls routing to those backends with no unconditional drop overload errors anymore.

I hope this comes helpful to someone who's having the same issue.

unconditional drop overload error on load balancers deployed through Kubernetes Gateway API