nvidia-smi Error: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver
There seems to be an NVIDIA driver issue in the A100 40GB VM instances that I spin up in GCP Compute Engine with a boot disk storage container, since `nvidia-smi` when SSHing in a new instance returns:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Manually installed CUDA Driver
Therefore, I’ve manually installed a CUDA driver by searching on:
Search for driver .run from https://www.nvidia.com/download/driverResults.aspx/191320/en-us/ :
wget https://us.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run sudo sh NVIDIA-Linux-x86_64-515.65.01.run
Then verified the CUDA driver is installed by:
$ nvidia-smi Fri Oct 21 10:03:18 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 | | N/A 34C P0 52W / 400W | 0MiB / 40960MiB | 2% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
NVIDIA Driver Error: Found no NVIDIA driver on your system
However, my Python script that loads models to CUDA still errored out with RuntimeError: Found no NVIDIA driver on your system.
For context, this boot disk storage container (including Python script) runs successfully on GCP P100, T4, V100 GPUs on GCP. Please see stack trace below:
File "service.py", line 223, in download_models config['transformer']['model'][model_name] = model_name_function_mapping[model_name](model).eval().cuda() File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 128, in cuda return super().cuda(device=device) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda> return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 215, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Question
Can you please advise how to solve this NVIDIA driver issue? Our deployment spins up a GPU VM on-demand as inference requests arrive, thus ideally the A100 VM on GCP already has an Nvidia Driver pre-installed to avoid latency. Thank you!
Due to the nature of the issue you're experiencing, it would be impossible to reproduce the issue without inspecting your project. Please follow one of these 2 options in order for GCP Support to assist you:
Click here to create a Support Case
https://cloud.google.com/support-hub
Click here to create a Public Issue Tracker
https://b.corp.google.com/issues/new?component=187134&template=1162898
I have the exact same problem. The GPU worked fine a few days ago. I started the instance again today and `nvidia-smi` displays that same error. It's like the driver disappeared. Did you have any luck figuring out what happened?
I haven't figured it out yet unfortunately. Were you starting up an A100 GPU VM instance with a custom VM storage (OS) image with Nvidia Driver/ CUDA toolkit pre-installed, which worked before but now fails? Appreciate if you could keep me posted if you figure it out!
Yes, that's exactly what happened, I used the c0-deeplearning-common-cu110-v20220806-debian-10 image with the driver preinstalled, same GPU as well.
I did resolve it! By reinstalling the driver manually. Here are the steps:
purge the driver with:
sudo apt-get purge nvidia-*
sudo apt-get update
sudo apt-get autoremove
Hope that helps!
Upon a more careful re-read, I see that you're spinning up instances instead of starting an existing instance. And that you already reinstalled the driver manually. For what it's worth, my python code is able to utilise the GPU via pytorch. Apologies for the misinformed answer!
Thank you for your reply!
hi, I found the same issue, but i got the solution that I may have to run the following script to reinstall the nvidia drivers on the server again on restart:
Find the command below for same
`/opt/deeplearning/install-driver.sh`
User | Count |
---|---|
4 | |
2 | |
1 | |
1 | |
1 |