GPUs not available inside containers

vedantroy-genmo · 07-29-2024 01:23 AM

I'm using GCP batch with GPUs, but I'm getting errors like the following:

```

assert torch.cuda.is_available(), "GPU not available"

```

or:

```

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

```

I am running my jobs inside of containers. And, I'm installing the versions of Pytorch, etc. that have CUDA drivers.

Here is my job config: https://paste.gg/p/anonymous/1678edc73dde45459ade23caa76ec260. The job ID is: process-data-id28-3785087a-13a1-46ba00.

vedantroy-genmo

Using `--gpus=all`, or `--runtime=nvidia` does not help either.

alexmoore

Hi,

Are you installing the drivers as per:

https://cloud.google.com/batch/docs/create-run-job-gpus#requirements-job-use-gpu

This will get you GPU capability on the instance the batch job runs on. Then to get the GPU into the container you'll likely need to pass parameters to the taskSpec for the container. If you look here:

https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#Container

You will see an 'options' variable where you can specify other options to be passed to docker and then if you look on this page, you will see options to configure the container to be able to see the GPU:

https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#configure_containers_to_consume...

There is not an end to end guide for what you are trying to achieve that I can see, but hopefully these pieces allow you to make some progress.

All the best,

Alex