nvidia-smi Error: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver
There seems to be an NVIDIA driver issue in the A100 40GB VM instances that I spin up in GCP Compute Engine with a boot disk storage container, since `nvidia-smi` when SSHing in a new instance returns:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Manually installed CUDA Driver
Therefore, I’ve manually installed a CUDA driver by searching on:
Search for driver .run from https://www.nvidia.com/download/driverResults.aspx/191320/en-us/ :
wget https://us.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run sudo sh NVIDIA-Linux-x86_64-515.65.01.run
Then verified the CUDA driver is installed by:
$ nvidia-smi Fri Oct 21 10:03:18 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 | | N/A 34C P0 52W / 400W | 0MiB / 40960MiB | 2% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
NVIDIA Driver Error: Found no NVIDIA driver on your system
However, my Python script that loads models to CUDA still errored out with RuntimeError: Found no NVIDIA driver on your system.
For context, this boot disk storage container (including Python script) runs successfully on GCP P100, T4, V100 GPUs on GCP. Please see stack trace below:
File "service.py", line 223, in download_models config['transformer']['model'][model_name] = model_name_function_mapping[model_name](model).eval().cuda() File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 128, in cuda return super().cuda(device=device) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda> return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 215, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Question
Can you please advise how to solve this NVIDIA driver issue? Our deployment spins up a GPU VM on-demand as inference requests arrive, thus ideally the A100 VM on GCP already has an Nvidia Driver pre-installed to avoid latency. Thank you!
User | Count |
---|---|
7 | |
2 | |
1 | |
1 | |
1 |