A100 GPU VM on GCP: “NVIDIA-SMI has failed because...

glau · 10-21-2022 05:27 AM

nvidia-smi Error: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver
There seems to be an NVIDIA driver issue in the A100 40GB VM instances that I spin up in GCP Compute Engine with a boot disk storage container, since `nvidia-smi` when SSHing in a new instance returns:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Manually installed CUDA Driver
Therefore, I’ve manually installed a CUDA driver by searching on:
Search for driver .run from https://www.nvidia.com/download/driverResults.aspx/191320/en-us/ :

wget https://us.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
sudo sh NVIDIA-Linux-x86_64-515.65.01.run

Then verified the CUDA driver is installed by:

$ nvidia-smi
Fri Oct 21 10:03:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    52W / 400W |      0MiB / 40960MiB |      2%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

NVIDIA Driver Error: Found no NVIDIA driver on your system
However, my Python script that loads models to CUDA still errored out with RuntimeError: Found no NVIDIA driver on your system.

For context, this boot disk storage container (including Python script) runs successfully on GCP P100, T4, V100 GPUs on GCP. Please see stack trace below:

  File "service.py", line 223, in download_models
    config['transformer']['model'][model_name] = model_name_function_mapping[model_name](model).eval().cuda()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 128, in cuda
    return super().cuda(device=device)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 215, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Question
Can you please advise how to solve this NVIDIA driver issue? Our deployment spins up a GPU VM on-demand as inference requests arrive, thus ideally the A100 VM on GCP already has an Nvidia Driver pre-installed to avoid latency. Thank you!

JGerman23

Due to the nature of the issue you're experiencing, it would be impossible to reproduce the issue without inspecting your project. Please follow one of these 2 options in order for GCP Support to assist you:

Click here to create a Support Case

https://cloud.google.com/support-hub

Click here to create a Public Issue Tracker

https://b.corp.google.com/issues/new?component=187134&template=1162898

dsandi

I have the exact same problem. The GPU worked fine a few days ago. I started the instance again today and `nvidia-smi` displays that same error. It's like the driver disappeared. Did you have any luck figuring out what happened?

glau

I haven't figured it out yet unfortunately. Were you starting up an A100 GPU VM instance with a custom VM storage (OS) image with Nvidia Driver/ CUDA toolkit pre-installed, which worked before but now fails? Appreciate if you could keep me posted if you figure it out!

dsandi

Yes, that's exactly what happened, I used the c0-deeplearning-common-cu110-v20220806-debian-10 image with the driver preinstalled, same GPU as well.

I did resolve it! By reinstalling the driver manually. Here are the steps:

create a snapshot just in case
purge the driver with:
sudo apt-get purge nvidia-*
sudo apt-get update
sudo apt-get autoremove
reboot
for some reason, `nvidia-smi` was still present on the system, which is a problem because that's how the install script checks if a driver is already installed. so I got rid of it by doing `mv /usr/bin/nvidia-smi /usr/bin/nvidia-smi.backup`
run the install script https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#installation_scripts

Hope that helps!

dsandi

Upon a more careful re-read, I see that you're spinning up instances instead of starting an existing instance. And that you already reinstalled the driver manually. For what it's worth, my python code is able to utilise the GPU via pytorch. Apologies for the misinformed answer!

glau

Thank you for your reply!

yash_0709

hi, I found the same issue, but i got the solution that I may have to run the following script to reinstall the nvidia drivers on the server again on restart:

Find the command below for same

`/opt/deeplearning/install-driver.sh`

A100 GPU VM on GCP: “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver.”