Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Failing to install GPU driver on GKE AutoPilot

GKE version 

v1.28.5-gke.1217000

Since about 2 hours ago, whenever a new GPU node is added to my auto-pilot cluster, the node is marked as ready, but the nvidia-gpu-device-plugin pod is stuck pending. Looking at the log in the `nvidia-driver-installer` init container, i'm seeing this an error downloading the GPU installer. 

 

 

 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  860  100   860    0     0   397k      0 --:--:-- --:--:-- --:--:--  419k
I0312 15:24:46.351065    3547 installer.go:437] Getting the default GPU driver version
I0312 15:24:46.351554    3547 utils.go:88] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/17800.66.54/lakitu/gpu_default_version
I0312 15:24:46.427615    3547 install.go:246] Installing GPU driver version 535.129.03
I0312 15:24:46.427682    3547 cache.go:76] error: failed to read file /root/home/kubernetes/bin/nvidia/.cache: open /root/home/kubernetes/bin/nvidia/.cache: no such file or directory
I0312 15:24:46.427792    3547 utils.go:88] Downloading bucketlist from https://storage.googleapis.com/storage/v1/b/cos-tools/o?prefix=17800.66.54/lakitu/nvidia-drivers-535.129.03.tgz
I0312 15:24:46.449215    3547 installer.go:128] Configuring driver installation directories
I0312 15:24:46.509510    3547 installer.go:689] Downloading GPU driver installer version 535.129.03
I0312 15:24:46.510844    3547 utils.go:88] Downloading GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/tesla/535.129.03/NVIDIA-Linux-x86_64-535.129.03.run
E0312 15:24:46.545091    3547 install.go:457] failed to download file with description "GPU driver installer" from "https://storage.googleapis.com/nvidia-drivers-us-public/tesla/535.129.03/NVIDIA-Linux-x86_64-535.129.03.run" and install into "/usr/local/nvidia": failed to download GPU driver installer, status: 403 Forbidden
Waiting for GPU driver libraries to be available.

 

 

 

 

Screenshot 2024-03-12 at 11.03.06 AM.png

 

Solved Solved
7 6 2,532
1 ACCEPTED SOLUTION

6 REPLIES 6

Screenshot 2024-03-12 at 11.19.18 AM.png

here is our chart for the pending pod gauge. 

We are seeing this across our GKE clusters as well. New GPU nodes are not able to complete Nvidia driver installation.

We are experiencing the same issue on all our GKE nodes 

Also having this issue in GKE cluster.


E0312 20:33:50.550203 6980 install.go:452] failed to download GPU driver installer: failed to download file with description "GPU driver installer" from "https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/101/tesla/470_00/470.199...." and install into "/usr/local/nvidia": failed to download GPU driver installer, status: 403 Forbidden

We are experiencing the same issue in all of our clusters

Top Labels in this Space