Compute instance GPU uninstalled on restart

bawolf · 10-11-2022 11:24 AM

Hello!

My goal is to host a version of stable diffusion on Compute engine that I can use as a microservice to try to build apps with. Since I am a hobbyist, my plan was to power it up and down when I was actually using it to save a bit of money. I created this write up which describes my steps.

The first time I spin up the instance everything works as expected. I have to opt into installing the GPU, I follow the steps I outlined in the post and I can make requests to the compute instance.

However, when I restart the instance after shutting it down the docker image is unable to run because it doesn't have a driver for the GPU. I tried following the instructions from google for installing GPU drivers and sure enough `nvidia-smi` isn't installed.

bryantwolf@stable-diffusion-api:~$ sudo nvidia-smi
sudo: nvidia-smi: command not found

So I follow those instructions and end up with:

bryantwolf@stable-diffusion-api:~$ sudo nvidia-smi
Tue Oct 11 17:59:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    29W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

but when I try to run a docker image that requires GPU I end up with this error message

Running 'script/download-weights <my huggingface api key>' in Docker with the current directory mounted as a volume...
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ⅹ Docker is missing required device driver

Is there something about the restart process I don't understand that might help me mitigate this problem?

Andre_Fiesco

Hello, bawolf,

After checking several cases, I found these cases that can be useful in solving the issue with the drivers. It seems that the issue is with the installation of the drivers, and it has been reported that these commands should help:

This is the correct way to install NVIDIA driver on a GCP instance:

cd / 
sudo apt purge nvidia-*

Reboot

cd / 
sudo wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run 
sudo sh cuda_11.2.2_460.32.03_linux.run

Adjust your config accordingly as it pops options in the terminal.

Reboot.

Another way of installing the drivers:

Run manually: sudo dpkg --configure -a
Disconnect from the machine.
Connect again using SSH. Select Y again when asked to install nVidia Driver.

More information regarding these commands can be found in this stackoverflow case.

I found these other cases with similar problems that can be useful: