Re: GCP Batch use NVDIA GPU to train models what i...

gradientopt · 07-25-2024 10:22 AM

Hi,

I am running my model training process inside a docker container using GCP Batch service. I am using the batch-cos machine image and set

installGpuDrivers=True

I found online that in order for my pytorch to run successfully inside a container with cuda support, I need nvidia-container-toolkit, but it is also mentioned here https://cloud.google.com/batch/docs/create-run-job-gpus that "If your job has any container runnables and does not use Container-Optimized OS, you must also install the NVIDIA Container Toolkit" so if i am using batch-cos image I do not really need to worry about anything? Driver will be automatically installed for me and container toolkit comes with batch-cos? cuda toolkit comes with pytorch.

wenyhu

Hi @gradientopt,

Correct, if you are running Batch Container Only Job with GPU, Batch will auto-select Batch Container-Optimized OS image for you and you don't need to manually install the NVIDIA Container Toolkit for Batch Container-Optimized OS images.

The nvidia-container-toolkit is mainly for other OS types such as Debian for `--gpu all` options, while Container-Optimized OS relies on `--privileged`. Batch has auto added that options for your GPU jobs. And even you are using non Container-Optimized OS images, Batch will auto-install the `nvidia-container-toolkit` for you as long as your network allows.

Thanks,

Wenyan

gradientopt

Thanks! But could you explain what do you mean by '--gpu all' vs '--priviledged'?

gradientopt

I see what you mean, These are the options Batch use to run docker jobs. But why would Container-Optimized OS use '--priviledged' but Debian use '--gpus all'?

gradientopt

I searched online and it seems even if you use --priviledged option when launching docker, you still need nvidia-container-toolkit?

wenyhu

Those are actually the tech details how we support Batch GPU Container Jobs. You should be fine without doing any specific command.

gradientopt

A separate question I have is, if Batch automatically fetches a GPU driver for me, how to ensure that it will be compatible with my cuda version (I use cuda 12.1)? It seems that I only have control over GPU type but not which GPU driver to fetch.

wenyhu

We support GPU driver version specification: https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#accelerator.

For Container-Optimized OS images, it it supported and you can try to find a compatible driver version, but I would suggest you use the default version if it is feasible, because for Container-Optimized OS images, each image supports limited GPU driver versions, and Batch Container Optimized Image is built based on the Container-Optimized OS images. Ref:

https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#identify-driver. Non-COS images usually accept more versions.

gradientopt

Thanks for the reply! If I am understanding correctly, if I ask Batch to install gpu drivers for me, it will just fetch the default driver from the latest cos release milestone? As of today, milestone 113 and v535.183.01?

wenyhu

Mostly yes.

If you run container only Job with GPU, and Batch by default selects the Batch Container-Optimized OS image, it will be the latest GPU driver version which is `v550.90.07` with CUDA version 12.4.

If you use non-COS images such as Batch Debian image, the version Batch selects will be `v550.54.15` with CUDA version 12.4.

gradientopt

Thanks for the reply! So Batch would install CUDA on the Container-Optimized OS as well? i am wondering why is that necessary? The used cuda will be the cuda that is installed in the docker right?

wenyhu

Here the CUDA is the one installed by GPU driver installation script. No special installation.

AvishaiW

I hope it's not too late to hop on this thread.

The behaviour I'm seeing is that when I try to run an image that (1) isn't based on one of the nvidia/cuda images and (2) I don't specify a boot disk, then it defaults to COS and subsequently fails to detect GPUs.

The same container runs fine if I run it on a GCE VM with the COS boot image, with `docker run --runtime nvidia --gpus all`.

To run a GPU-enabled container successfully on Google Batch with the , I need to either specify the boot disk image as `batch-debian` or use a docker image that is based on one of the nvidia/cuda images.

It seems to me that this behaviour isn't aligned with what was explained above:

if you are running Batch Container Only Job with GPU, Batch will auto-select Batch Container-Optimized OS image for you and you don't need to manually install the NVIDIA Container Toolkit for Batch Container-Optimized OS images.
The nvidia-container-toolkit is mainly for other OS types such as Debian for `--gpu all` options, while Container-Optimized OS relies on `--privileged`. Batch has auto added that options for your GPU jobs. And even you are using non Container-Optimized OS images, Batch will auto-install the `nvidia-container-toolkit` for you as long as your network allows.

In other words, `--privileged` seems to not be enough for containers to access the GPUs.

Am I missing something, or is this behaving as intended?

wenyhu

Hi @AvishaiW,

We would expect if you follow https://cloud.google.com/batch/docs/create-run-job-gpus#create-job-gpu-examples, it should work for both COS based or non-COS images.

The log (https://cloud.google.com/batch/docs/analyze-job-using-logs#view-job-logs) should be able to tell you whether your nvidia driver has been installed successfully or not in the host machine. If it tells you that the driver has been installed successfully, then usually you should be able to consume the GPU driver in your container. If that does not work for your image, I have several questions.

(1) What kind of image are you using besides the docker image? E.g. if you use Image Streaming, you need to enable image streaming in your job request: https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#container. If that happens to be some container image cases that we haven't covered, we can also help check on our side.

(2) Would you mind trying to add more `--volume` options similar as https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#configure_containers_to_consume... to make sure all your required disks are mounted?

(3) Would you mind sharing more error logs if the above does not help?

Thanks!

AvishaiW

@wenyhu wrote:
The log (https://cloud.google.com/batch/docs/analyze-job-using-logs#view-job-logs) should be able to tell you whether your nvidia driver has been installed successfully or not in the host machine.

I see this in the logs but I am unable to consume the GPU driver:

> GPU drivers successfully installed.
> Making the GPU driver installation path executable by re-mounting it.

(1) Only the docker image.

(2) This didn't make a difference

(3) I'm building an image FROM python:3-slim-bullseye and pip installing nvidia-ml-py. Then in the python file I try to `pynvml.nvmlInit()` and it raises an exception. As mentioned before, this works fine when I run it on a GCE VM with the COS boot image, with `docker run --runtime nvidia --gpus all`.

I do see these warnings in the logs, but I don't know if they're relevant:

> E0929 00:02:18.755704 1792 utils.go:355] WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
> E0929 00:02:19.015138 1792 utils.go:355] WARNING: nvidia-installer was forced to guess the X library path '/usr/local/nvidia/lib64' and X module path '/usr/local/nvidia/lib64/xorg/modules'; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
> E0929 00:02:19.015169 1792 utils.go:355] WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.

wenyhu

Hi @AvishaiW,

If `docker run --runtime nvidia --gpus all` works for you when you manually run the script, could you try to add `--runtime nvidia --gpus all` in the options field (https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#container) to see whether it can by pass your issue?

Batch basically runs the docker commands for you based on the information you provide on the container runnable.

AvishaiW

When I try adding `--runtime nvidia --gpus all` then I get an error "docker: Error response from daemon: unknown or invalid runtime name: nvidia.".

If I try with just `--gpus all` then I get an error "invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead".

Here's a minimal Dockerfile and job config to reproduce the issue:

FROM python:3-slim-bullseye

RUN pip install nvidia-ml-py
ENTRYPOINT ["python", "-c", "import pynvml ; pynvml.nvmlInit()"]

{
    "taskGroups": [
        {
            "taskCount": "1",
            "parallelism": "1",
            "taskSpec": {
                "computeResource": {
                    "cpuMilli": "1000",
                    "memoryMib": "1024"
                },
                "runnables": [
                    {
                        "container": {
                            "imageUri": "<my-docker-image>",
                            "entrypoint": "",
                            "volumes": [],
                            "options": "--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl"
                        }
                    }
                ],
                "volumes": []
            }
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "installGpuDrivers": true,
                "policy": {
                    "provisioningModel": "SPOT",
                    "machineType": "n1-standard-2",
                    "accelerators": [
                        {
                            "type": "nvidia-tesla-t4",
                            "count": "1"
                        }
                    ],
                    "bootDisk": {
                        "sizeGb": "75"
                    }
                }
            }
        ],
        "labels": {
            "user": "avishai"
        },
        "serviceAccount": {
            "email": "<service-account-email>"
        }
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

I submit with `gcloud beta batch jobs submit --project <project> --location us-central1 --job-prefix=my-prefix --config /path/to/config.json`

AvishaiW

@wenyhucould you comment if I'm doing anything wrong, or if this is acknowledged as a Google Batch issue? I'd like to know why I'm able to run this same container manually on GCE successfully, but not with Google Batch.

wenyhu

Hi @AvishaiW,

I built the docker image based on the Dockerfile you shared, and yes I see the same error as what you mentioned.

However, I also tried to run this docker image on the VM which is based on the latest COS, I see same error as `docker: Error response from daemon: unknown or invalid runtime name: nvidia` when I manually run `sudo docker run --runtime nvidia --gpus all <DOCKER_IMAGE>`. And I encounter the same error as Batch's error if I do `sudo docker run --privileged <DOCKER_IMAGE>`, which means the behavior on the GCE VM and the Batch job are the same when I try to repro. That actually matches with my expectation since Batch mainly helps run the docker command without magic.

Therefore, if you GCE VM case works, could you point me more information such as (1) which GCE VM with which COS OS version were you running with? (2) what exactly docker command were you running? Is that `sudo docker run --runtime nvidia --gpus all <DOCKER_IMAGE>`? (3) Is there other packages you pre-installed on your GCE VM? e.g. I at least manually installed the GPU driver on the GCE VM.

Thanks!

AvishaiW

I realize I didn't use the COS image but the GPU-optimized Debian OS image with CUDA support image (a.k.a "Deep learning on Linux"). My mistake. Is there a way to get Google Batch to use that image? Or alternatively, to reach the same setup as the GPU-optimized one?

This is the prompt you get when getting a shell on the machine for the first time:

======================================
Welcome to the Google Deep Learning VM
======================================

Version: common-cu118.m125
Resources:
* Google Deep Learning Platform StackOverflow: https://stackoverflow.com/questions/tagged/google-dl-platform
* Google Cloud Documentation: https://cloud.google.com/deep-learning-vm
* Google Group: https://groups.google.com/forum/#!forum/google-dl-platform

To reinstall Nvidia driver (if needed) run:
sudo /opt/deeplearning/install-driver.sh
Linux avishai-gpu-test 5.10.0-32-cloud-amd64 #1 SMP Debian 5.10.223-1 (2024-08-10) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

This VM requires Nvidia drivers to function correctly. Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n] y
Installing Nvidia driver.
....

wenyhu

Hi @AvishaiW,

OOC, is there any specific reason you wanted to use COS image instead of Debian image. Batch Debian image provides almost the same as Batch COS image for GPU support. Specifying the boot disk image as `batch-debian` should already help overcome the issue and use Debian image as Batch job's default image, similar as the Deep Learning Linux.

If you do want to use the specific Deep Learning image instead of the Batch image, you can also specify the Deep Learning image url on the boot disk field following https://cloud.google.com/batch/docs/specify-vm-os-image.

Hope this helps!

AvishaiW

This helps, thank you!

gradientopt

@wenyhu so what is our conclusion here? We should not use

batch-cos but use batch-debian instead when creating container only gpu jobs?

gradientopt

But i tried using batch-debian instead of batch-cos to run container-only job and specified --runtime=nvidia in the options field and i got the error from log: docker: Error response from daemon: unknown or invalid runtime name: nvidia

But from the log it seems that nvidia driver and nvidia container toolkit is installed?

[BATCH NVIDIA Container Toolkit]: NVIDIA Container Toolkit installed

AvishaiW

@gradientoptno need to specify any custom option when you specify the batch-debian image. It just works.

gradientopt

Thanks! @wenyhu I am wondering why batch-cos would not work? Ideally we would want this to work since we are running container jobs and batch-cos should be more efficient?

wenyhu

That might be related to your docker image requirement. For most of the GPU container job cases, both Batch COS image and Debian image work. Container-Optimized OS image is a Google only image designed for read-only files, there might be limitations.

You can refer to https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus for more information.

stephanie-brgm

Hi there,

Following all your advices here, I tried launching a batch job on a g2 instance, using "batch-debian" image with no option, but when running a check via `torch.cuda.is_available()` I get a False.

I also tried using the deep learning image directly () but it failed to start with this error message: `Failed to reload sshd.service: Unit sshd.service not found`.

@wenyhu would you mind helping me notice what I am doing wrong ?

stephanie-brgm

Hi there,

I finally got it working with batch-debian as boot disk and --gpus all as unique option, note that adding the devices as mentioned here did not help.

GCP Batch use NVDIA GPU to train models what installation are required?