A100 GPU VM on GCP: “NVIDIA-SMI has failed because... - Page 2

glau · 10-21-2022 05:27 AM

nvidia-smi Error: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver
There seems to be an NVIDIA driver issue in the A100 40GB VM instances that I spin up in GCP Compute Engine with a boot disk storage container, since `nvidia-smi` when SSHing in a new instance returns:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Manually installed CUDA Driver
Therefore, I’ve manually installed a CUDA driver by searching on:
Search for driver .run from https://www.nvidia.com/download/driverResults.aspx/191320/en-us/ :

wget https://us.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
sudo sh NVIDIA-Linux-x86_64-515.65.01.run

Then verified the CUDA driver is installed by:

$ nvidia-smi
Fri Oct 21 10:03:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    52W / 400W |      0MiB / 40960MiB |      2%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

NVIDIA Driver Error: Found no NVIDIA driver on your system
However, my Python script that loads models to CUDA still errored out with RuntimeError: Found no NVIDIA driver on your system.

For context, this boot disk storage container (including Python script) runs successfully on GCP P100, T4, V100 GPUs on GCP. Please see stack trace below:

  File "service.py", line 223, in download_models
    config['transformer']['model'][model_name] = model_name_function_mapping[model_name](model).eval().cuda()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 128, in cuda
    return super().cuda(device=device)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 215, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Question
Can you please advise how to solve this NVIDIA driver issue? Our deployment spins up a GPU VM on-demand as inference requests arrive, thus ideally the A100 VM on GCP already has an Nvidia Driver pre-installed to avoid latency. Thank you!

A100 GPU VM on GCP: “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver.”