GKE version
Since about 2 hours ago, whenever a new GPU node is added to my auto-pilot cluster, the node is marked as ready, but the nvidia-gpu-device-plugin pod is stuck pending. Looking at the log in the `nvidia-driver-installer` init container, i'm seeing this an error downloading the GPU installer.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
860 100 860 0 0 397k 0 --:--:-- --:--:-- --:--:-- 419k
I0312 15:24:46.351065 3547 installer.go:437] Getting the default GPU driver version
I0312 15:24:46.351554 3547 utils.go:88] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/17800.66.54/lakitu/gpu_default_version
I0312 15:24:46.427615 3547 install.go:246] Installing GPU driver version 535.129.03
I0312 15:24:46.427682 3547 cache.go:76] error: failed to read file /root/home/kubernetes/bin/nvidia/.cache: open /root/home/kubernetes/bin/nvidia/.cache: no such file or directory
I0312 15:24:46.427792 3547 utils.go:88] Downloading bucketlist from https://storage.googleapis.com/storage/v1/b/cos-tools/o?prefix=17800.66.54/lakitu/nvidia-drivers-535.129.03.tgz
I0312 15:24:46.449215 3547 installer.go:128] Configuring driver installation directories
I0312 15:24:46.509510 3547 installer.go:689] Downloading GPU driver installer version 535.129.03
I0312 15:24:46.510844 3547 utils.go:88] Downloading GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/tesla/535.129.03/NVIDIA-Linux-x86_64-535.129.03.run
E0312 15:24:46.545091 3547 install.go:457] failed to download file with description "GPU driver installer" from "https://storage.googleapis.com/nvidia-drivers-us-public/tesla/535.129.03/NVIDIA-Linux-x86_64-535.129.03.run" and install into "/usr/local/nvidia": failed to download GPU driver installer, status: 403 Forbidden
Waiting for GPU driver libraries to be available.
Solved! Go to Solution.
here is our chart for the pending pod gauge.
We are seeing this across our GKE clusters as well. New GPU nodes are not able to complete Nvidia driver installation.
We are experiencing the same issue on all our GKE nodes
Also having this issue in GKE cluster.
E0312 20:33:50.550203 6980 install.go:452] failed to download GPU driver installer: failed to download file with description "GPU driver installer" from "https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/101/tesla/470_00/470.199...." and install into "/usr/local/nvidia": failed to download GPU driver installer, status: 403 Forbidden
We are experiencing the same issue in all of our clusters