scale down blocked by pod

VivekGadhia · 07-25-2024 07:55 AM

I deployed 300 pods using below yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ubuntu-deployment
namespace: test
labels:
    app: ubuntu
spec:
replicas: 300
selector:
    matchLabels:
      app: ubuntu
template:
    metadata:
      labels:
        app: ubuntu
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      containers:
      - name: ubuntu
        image: ubuntu:latest
        command: ["/bin/sh"]
        args: ["-c", "while true; do sleep 1;done"]
        ports:
        - containerPort: 80

When I scaled down by reducing the workload, pods were terminating immediately but for one node I got this error: "Pod is blocking scale down because its controller can't be found".

"noScaleDown": {
"nodes": [
{
"node": {
"cpuRatio": 2,
"name": "gke-cluster-test-default-pool-b4e5d6f6-jf9h",
"mig": {
"name": "gke-cluster-test-default-pool-b4e5d6f6-grp",
"nodepool": "default-pool",
"zone": "us-central1-c"
},
"memRatio": 2
},
"reason": {
"messageId": "no.scale.down.node.pod.controller.not.found",
"parameters": [
"ubuntu-deployment-69646b5cdc-2kcx5"
]
}
}
],
"nodesTotalCount": 1
}
}
Pod was deployed with Kind: Deployment(controller) still at time of scale down Deployment controller deleted/removed without removing its pods. I am facing similar issue with Kubernetes job(controller) as well.

Any solution to fix this issue other than manually deleting pod?

Manish_B

Hello @VivekGadhia,

Thank you for contacting Google Cloud Community.

I understand that you have some issues with the Cluster scale down process and seems to be blocked because its controller can’t be found in the Pod. Please correct me if I have misunderstood or If I missed anything.

There could be multiple reasons for this issue, NoScaleDown events are best effort and one of the issues could be due to the mis-configuration of YAML files. Scale-down does not consider nodes to be candidates for scale-down if there are some pods there without a pod controller. Without it we have no guarantee that we can safely evict the pod. This is marked by no.scale.down.node.pod.controller.not.found event.

Scale-down candidates need to be removable for a certain time period to be removed.

From the documentation I found the next document [1] which does provide a very vague description but does give a mitigation of your specific issue. Which is: Review the logs to determine what actions were taken that left a Pod running after its controller was removed. To resolve, you can manually delete the Pod.

Now as an extra recommendation, Scale-down improves resource utilization by evicting pods from underutilized nodes. If you have some workloads that you would want to safeguard from this, then you could define PDBs for these pods. Alternative would be to use "default" autoscaling profile which is less aggressive during scale-down or disabling autoscaling altogether.

I hope you will find this information helpful 🙂

Thanks & Regards,
Manish Bavireddy.

[1] https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler-visibility#:~:text=%22no.s...

garisingh

Are you sure the pod was one of the replicas from your deployment and not another pod?

VivekGadhia

Yes error log showing same pod name what I deployed and I didn't deployed any other pods on cluster.