Hi, close to the GCP outage, at 2025-06-12 13:13:40 EDT I started having issues on multiple load balancer rules. When accessing the URL that resolves to a load balancer created with Gateway API resources that points to specific backends I only see a unconditional drop overload error as a response.
The backends are created by Gateway API HttpRoutes and point to ClusterIP Kubernetes services that point to a healthy Kubernetes deployment, I know the deployment is healthy since it had enough resources to operate and the deployment pod could be accessed and responds through in-cluster requests, I also know that the requests made to the load balancer IPs would end there since I got these messages from the load balancer logs when trying to query the urls that reached the faulty backends:
jsonPayload: {
@type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
backendTargetProjectNumber: "projects/xxxxxxxxxxxx"
cacheDecision: [2]
remoteIp: "x.x.x.x"
statusDetails: "failed_to_connect_to_backend"
}
After looking at the load balancer backend services availability, I noticed none of the backends were actually available (0 of 0 in all backends) but I'm confused since there's no documentation on Google side for the unconditional drop overload, after further inspection I see it's an error related to Envoy Proxy, probably on GCP side.
Hi, after a lot of hours of debugging I found a solution for the issue, not really sure what combination of actions was the final solution but I'll tell you what I did:
I recreated all of the Gateway HttpRoutes associated to the every service that was having troubles to be connected to the load balancer, it did nothing, then I deleted the service and the pod and it didn't work, but I noticed the ServiceNetworkEndpointGroup wouldn't get deleted, even if the service was deleted.
I went ahead and inspected the resource, this is what I got:
Name: k8s1-xxxxxx-nginx-redirect-service-xxxx
Namespace: xxxx
Labels: networking.gke.io/managed-by=neg-controller
networking.gke.io/service-name=nginx-redirect-service-xxx
networking.gke.io/service-port=80
Annotations: <none>
API Version: networking.gke.io/v1beta1
Kind: ServiceNetworkEndpointGroup
Metadata:
Creation Timestamp: 2024-11-13T12:13:06Z
Deletion Grace Period Seconds: 0
Deletion Timestamp: 2025-06-13T01:28:00Z
Finalizers:
networking.gke.io/neg-finalizer
Generation: 165
Owner References:
API Version: v1
Block Owner Deletion: false
Controller: true
Kind: Service
Name: nginx-redirect-service-xxxx
UID: xxxx-xxxx-xxxx-xxxx-xxxx
Resource Version: 9701xxxxx
UID: xxxx-xxxx-xxxx-xxxx-xxxx
Spec:
Status:
Conditions:
Last Transition Time: 2025-06-12T18:46:20Z
Message: googleapi: Error 503: Policy checks are unavailable., backendError
Reason: NegInitializationFailed
Status: False
Type: Initialized
Last Transition Time: 2025-06-12T18:14:03Z
Message: failed to get NEG for service: googleapi: Error 503: Policy checks are unavailable., backendError
Reason: NegSyncFailed
Status: False
Type: Synced
Last Sync Time: 2025-06-12T19:12:30Z
Network Endpoint Groups:
Id: xxxx
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-b/networkEndpointGroups/k8s1-xxxxxx-nginx-redirectwservice-xxx
Id: yyyy
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-c/networkEndpointGroups/k8s1-xxxxxx-nginx-redirect-service-xxx
Id: zzzz
Network Endpoint Type: GCE_VM_IP_PORT
Self Link: https://www.googleapis.com/compute/beta/projects/xxxxx/zones/us-central1-f/networkEndpointGroups/k8s1-xxxxxx-nginx-redirect-service-xxx
The error "failed to get NEG for service: googleapi: Error 503: Policy checks " was weird since this resource is managed automatically by the service (NEG controller)
I proceeded to delete the ServiceNetworkEndpointGroup by myself:
kubectl delete servicenetworkendpointgroup k8s1-xxxxxx-nginx-redirect-service-xxx -n xxx
Then I noticed it had finalizers, it was deleting forever so I patched the finalizers:
kubectl patchk8s1-xxxxxx-nginx-redirect-service-xxx --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]' -n xxxx
The issue wasn't fixed, so I basically deleted and recreated the service and did the same for the deployment, after a few seconds-minutes, I could access the urls routing to those backends with no unconditional drop overload errors anymore.
I hope this comes helpful to someone who's having the same issue.