Re: Unable to install drivers on Compute Engine VM...

mtam2013 · 02-16-2024 12:12 PM

I spun up a new Compute Engine VM Instance (n1-standard-4 machine w 2 T4 GPUs) and tried to install NVIDIA drivers to use the GPUs on the new instance following the instructions here: https://cloud.google.com/compute/docs/gpus/install-drivers-gpu

Configuration: Linux Debian x86_64 version 12

When I tried running the utility script provided by GCP

install_gpu_driver.py

I received the following error:

ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol '__rcu_read_lock' 
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol '__rcu_read_unlock'

I received the same error when using the CUDA Toolkit to install the NVIDIA drivers via the CUDA Toolkit

wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_lin...
sudo sh cuda_12.3.2_545.23.08_linux.run

From https://forums.developer.nvidia.com/t/linux-6-7-3-545-29-06-550-40-07-error-modpost-gpl-incompatible... it seems like there's an issue with the latest NVIDIA drivers for this kernel version which will be patched in a future release. Until then, how can I get NVIDIA drivers installed on a VM Instance to unblock me from training w GPUs on Google Cloud Platform? Can I configure a new VM Instance on an older kernel version?

lawrencenelson

Hi @mtam2013,

Welcome to the Google Cloud Community!

If I got your question right, in the same document that you linked, you can utilize Deep Learning images for your VM since they come with NVIDIA Drivers pre-installed [1]:

Alternatively, you can skip this setup by creating VMs with Deep Learning VM images. Deep Learning VM images have NVIDIA drivers pre-installed, and also include other machine learning applications such as TensorFlow and PyTorch.

You can view these documents for more information:

I hope this helps. Thank you.

[1]. https://cloud.google.com/compute/docs/gpus/install-drivers-gpu

shrini_kamat

You could downgrade the kernel and then set the VM to boot in to older kernel version. I have got this working using below :

# check the current kernel version if its the below version then you need to have the older version installed with headers

1 uname -r

linux-image-6.1.0-18-cloud-amd64

2 apt-cache search linux | grep -P "^linux-image[^ ]*[0-9]\-amd64 "

3 apt-cache search linux | grep -P "^linux-image[^ ]*[0-9]\-cloud-amd64 "

4 sudo apt install -t linux-image-6.1.0-17-cloud-amd64

5 sudo apt install linux-image-6.1.0-17-cloud-amd64

6 apt-cache search linux | grep -P "^linux-header[^ ]*[0-9]\-cloud-amd64 "

7 sudo apt install linux-headers-6.1.0-17-cloud-amd64

# set the Grub loader to boot older version

8 sudo grub-reboot '1>2'

9 sudo reboot

10 uame -r

11 curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gp... --output install_gpu_driver.py

12 sudo python3 install_gpu_driver.py

13 sudo vim /etc/default/grub

# set the value in the above file to boot older kernel as default every time

GRUB_DEFAULT="1>2"

15 nvidia-smi

PrabhakarSai

I had the same error, you can change the driver version as per compatibility guidelines in https://cloud.google.com/compute/docs/gpus/install-drivers-gpu and looking up the available drivers options via gsutil ls gs://nvidia-drivers-us-public/tesla/.
Once you found a driver you want to change to edit the driver version file vi /opt/deeplearning/driver-version.sh you might need sudo to edit it (sudo vi /opt/deeplearning/driver-version.sh). and rerun the installer sudo /opt/deeplearning/install-driver.sh. 535.183.01 worked for me

tmlabonte

Thanks, 535.183.01 worked for me as well on an N1 machine with 4x P100 GPUs.

Unable to install drivers on Compute Engine VM Instance