Reporting outage - GKE autopilot scheduling with N...

agamble · 07-27-2023 01:26 PM

We are seeing issues with GKE Autopilot scheduling new nodes with GPU instances. Pods are currently unable to be scheduled on new nodes, due to the automatic NVIDIA driver installation managed GKE Autopilot failing.

This is blocking our ability to schedule any new NVIDIA GPUs on GKE autopilot.

We are seeing that the nvidia-gpu-device-plugin-small pod managed under the kube-system namespace is broken, due to the cos-nvidia-installer:fixed container being unable to run.

Looking at this pod more closely, it appears to be failing due to:

curl --retry 5 -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-labels

This command appears to be run to get the relevant labels for NVIDIA driver versions that can then be fetched and installed on the node. I have run this same command in another node on our cluster and can see the endpoint "http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-labels" is returning 404.

Is this issue blocking all other customers using GKE Autopilot from running new workloads on GPUs?

Thank you,

Alex

garisingh

Is this a new Autopilot cluster? And which version?

agamble

Hi, this is an existing cluster. The issue is occurring when the cluster creates new nodes for scheduling additional pods requiring GPUs.

The cluster version is 1.27.2-gke.1200.

garisingh

I've tried this on a few different clusters and I seem to be able to deploy GPU workloads on Autopilot clusters with cluster autoscaler / NAP creating new nodes. I tried same version as you as well as the latest 1.26.x version as well.

Were you able to deploy GPU workloads in the past?

agamble

Ah that is strange. Yes, this cluster has one older node that is running a GPU workload successfully.

The issue for me started happening when I scaled up the replicas of an existing workload that used GPUs. So the older nodes (~5d old) are running the workload just fine, but any new nodes being provisioned to host the new replicas are failing to initialize before the workload can be scheduled.

These are the logs I'm seeing from the kube-system managed nvidia-gpu-device-plugin pods on the new nodes. I can see that the cos-nvidia-installer:fixed is running a bash script that should be applying the labels that are fetched from:

curl --retry 5 -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-labels

But running this line of the script on another node shows that the URL is returning 404, so no NVIDIA drivers are being installed 😞

A bit lost as to what's going on here - I've tried draining the faulty nodes but they are being recreated with exactly the same issue.

garisingh

Couple more questions just to help debug:
Which GPU are you using? T4 or A100
Which region is your cluster in?

agamble

I'm using T4 GPUs, the cluster is in us-central1.

Thanks for your help garisingh!

garisingh

Hmm ... the good news is that's where my clusters are. The bad news is I'm not seeing these errors. But will keep digging.

agamble

Thanks. In your clusters, is the nvidia-gpu-device-plugin-small pod in the kube-system namespace being initialized correctly?

And if you run the curl command (from within another node in the cluster) from the script in that pod, does it successfully return the labels with the GPU driver version?

agamble

I've managed to resolve this issue by destroying and recreating the cluster.

victorcg

Hello, me too. I created a GKE standard cluster this morning with the same version 1.27.2-gke.1200 mentioned in this post, but I forgot to add GPUs to my node pools. After deleting them and recreating the node pools with 1x T4 GPU, I ran into the same issue where `nvidia-gpu-device-plugin-small-cos` pods in kube-system are still initializing after 30 minutes, waiting with the following logs:

```

kubectl logs nvidia-gpu-device-plugin-small-cos-lvb9v -nkube-system -c nvidia-driver-installer
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 657 100 657 0 0 317k 0 --:--:-- --:--:-- --:--:-- 641k
GPU driver auto installation is disabled.
Waiting for GPU driver libraries to be available.

```

Recreating the cluster now to see if it helps

victorcg

Recreated fresh new 1.27.2-gke.1200 standard cluster, still getting this issue. Trying to recreate it as 1.26.5-gke.1400 now.

victorcg

Disregard my previous messages, for Standard clusters, starting from GKE 1.27, need to use the new configuration that specifies GPU drivers should be installed, because they are not installed by default. In previous versions of GKE we had to use the nvidia driver daemonset instead.

In my case the logs showed the pods blocking the GPU, waiting for drivers to be installed, but I did not have anything that installs them (like the nvidia driver daemonset).

My specific case was with terraform, so I added gpu_driver_version under gpu_driver_installation_config under guest_accelerator in my google_container_node_pool and the problem was resolved. Although this post is about Autopilot, I'm writing this in case other GKE Standard users reach this post from Google like I did. Note that this configuration is not valid for older GKE versions, only 1.27+

Reporting outage - GKE autopilot scheduling with NVIDIA GPUs broken