Service type Loadbalancer stuck on "Ensuring load ...

Bmtl · 02-22-2023 11:34 AM

Hi,

I deployed a private GKE cluster, I got external HTTP(s) load balancers working with ingress resources including global external static ip addresses, but when it comes to External TCP load balancers with an external/regional IP address the service is stuck on "Ensuring load balancer" and no load balancer resources are shown in GCP.

gcloud compute addresses describe test

address: 35.203.116.161
addressType: EXTERNAL
creationTimestamp: '2023-02-22T07:09:20.939-08:00'
description: ''
id: '6492500574739336911'
kind: compute#address
name: test
networkTier: PREMIUM
region: https://www.googleapis.com/compute/v1/projects/test-stg/regions/northamerica-northeast1
selfLink: https://www.googleapis.com/compute/v1/projects/test-stg/regions/northamerica-northeast1/addresses/test
status: RESERVED

kubectl -n default describe service store-v1-lb-svc 
Name:                     store-v1-lb-svc
Namespace:                default
Labels:                   <none>
Annotations:              cloud.google.com/l4-rbs: enabled
Selector:                 app=store
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.16.51.150
IPs:                      172.16.51.150
IP:                       35.203.116.161
Port:                     tcp-port  8080/TCP
TargetPort:               8080/TCP
NodePort:                 tcp-port  30288/TCP
Endpoints:                172.16.0.8:8080,172.16.5.3:8080
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                Age    From                Message
  ----    ------                ----   ----                -------
  Normal  EnsuringLoadBalancer  41m  service-controller  Ensuring load balancer

Service exists in GKE UI

But Loadbalancer "a1b7fcace331d4eab99b37465eb5bbb1" does not exist

I can't find anything in GKE logs and don't know how tcp LBs are created on the GCP side.

Any help would be greatly appreciated !

Thanks

Marvin_Lucero

Hi @Bmtl,

Based from the error or message saying "Ensuring load balancer", external IP of the Load Balancer is stuck as Pending.

Here's what you can do, test with a regional external IPv4 address in the same region as the cluster. This is discussed in the About LoadBalancer Service parameters document. Allocate a new Regional IP and update the Service loadBalancerIP to the new IP address and it will attach to the new IP address.

Bmtl

Hi Marvin,

Thanks a lot for the answer, sadly this doesn't work. I've recreated the IP multiple times in the same region without success.

What do you mean by "external IP of the Load Balancer is stuck as Pending" ? The regional IP address is marked as Reserved and assigned to None.

Also can we view the logs of the GKE plugin which creates the GCP TCP LB?

Bmtl

It seems that the issue is linked to the previous loadbalancer not being deleted, it is stuck trying to delete a loadbalancer which doesn't exist.

I have tried deleting it using the gcloud cli but the resource does not exist.

chrism_doit

The finalizer is probably stuck in a bad state from a multi-year-long bug related to race conditions, you'd need to kubectl patch to fix the state then manually delete.
https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/
^-- Finalizers are responsible for garbage collection / auto deletion of auto provisioned resources. Sometimes they get stuck. As long as you verify the resource got deleted or manually delete it to prevent orphaned cloud resources, it should be safe to delete a finalizer and manually delete.

https://github.com/kubernetes/kubernetes/issues/39420#issuecomment-546781470
^-- makes it seem like there's a few race condition type bugs that can leave a service stuck in deleting. / put service controller in a confused state. (like a delete and update operation occur at same time) or delete can't succeed because already deleted. And a finalizer never gets updated/removed due to stuck inconsistent state due to race condition bug.

https://www.middlewareinventory.com/blog/kubectl-delete-stuck-what-to-do/
^-- really useful read
basically says
1. try waiting longer
2. enable kube controller manager logs (to see service manager logs), might tell you why it's stuck. (like can't delete because already deleted) (or an update and delete happened at the same time and both operations block each other.)
3. If logs don't help, or you've confirmed the GCP LB has been deleted, you should be safe to patch the finalizer to null value, and manually delete.

kubectl patch (pod|job|ingress|pvc) <name-of-resource> \
-p '{"metadata":{"finalizers":[]}}' --type=merge

https://cloud.google.com/kubernetes-engine/docs/troubleshooting#namespace_stuck_in_terminating_state
^-- Even GKE's docs suggest to patch the finalizer with null value to fix inconsistent state as a troubleshooting solution when stuck. (though this is in the context of namespace)

fmalmo

Hi! We're running into this exact problem described by @Bmtl , and we're wondering if there was ever made any progress on this?

It's very inconsistent and hard to debug so if there are any logs I can enable or such that'd be of great help.

I've created an issue for this at GKE: https://issuetracker.google.com/issues/318528552

parthyp

TIL;

when we are not able to delete _things_ via kubectl, patch it

Echoing solution from above:

kubectl patch (pod|job|ingress|pvc) <name-of-resource> \
-p '{"metadata":{"finalizers":[]}}' --type=merge

Service type Loadbalancer stuck on "Ensuring load balancer" when specifying static external IP addr