Solved: Re: Prebuilt container not recognizing GPU in Vert...

g_kolpa · 12-09-2024 06:52 AM

I'm migrating our training pipeline to Vertex AI. I managed to get quite far but I'm running into the fact that the GPU is not recognized. I've been debugging and simplifying to see if I can get the GPU recognized, and I'm running into the base issue here.

The job input specification can be seen here:

{
  "workerPoolSpecs": [
    {
      "machineSpec": {
        "machineType": "n1-standard-8",
        "acceleratorType": "NVIDIA_TESLA_T4",
        "acceleratorCount": 1
      },
      "replicaCount": "1",
      "diskSpec": {
        "bootDiskType": "pd-ssd",
        "bootDiskSizeGb": 100
      },
      "pythonPackageSpec": {
        "executorImageUri": "europe-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-11:latest",
        "packageUris": [
          "gs://training-data-prostate/trainer-0.1.tar.gz"
        ],
        "pythonModule": "trainer.task",
        "args": [
          "--batch-size",
          "8"
        ],
        "env": [
          {
            "name": "SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL",
            "value": "True"
          },
          {
            "name": "CUDA_VISIBLE_DEVICES",
            "value": "3"
          },
          {
            "name": "det_models",
            "value": "/gcs/training-data-prostate/output"
          },
          {
            "name": "OMP_NUM_THREADS",
            "value": "1"
          }
        ]
      }
    }
  ],
  "baseOutputDirectory": {
    "outputUriPrefix": "gs://training-data-prostate/aiplatform-custom-training-2024-12-09-15:23:47.778"
  }
}

The environment variables can be ignored.

Of note, this message has occurred often, regardless of the image being used. I have also tried with

europe-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13.py310:latest
europe-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-3.py310:latest

The code being run in trainer.task performs the following:

import os
import torch

def run(args):
    print(f"Printing args: {args}")
    input_data_url = f"/gcs/training-data/t2.nii.gz"
    with open(input_data_url, 'rb') as f:
        f.seek(0, os.SEEK_END)
        # Get the size of the file
        file_size = f.tell()
    print(f'Done! file size: {file_size}')
    print(torch.__version__)
    print(torch.cuda.device_count())
    print(torch.cuda.get_device_name(0))

Finally, the output logs after succesful job set-up look like this:

INFO 2024-12-09T14:24:57.497426271Z [resource.labels.taskName: workerpool0-0] Running command: python3 -m trainer.task --batch-size 8
INFO 2024-12-09T14:24:58.367860316Z [resource.labels.taskName: workerpool0-0] Namespace(batch_size=8)
INFO 2024-12-09T14:24:58.368443489Z [resource.labels.taskName: workerpool0-0] Printing args: Namespace(batch_size=8)
INFO 2024-12-09T14:24:58.368552923Z [resource.labels.taskName: workerpool0-0] Done! file size: 9612380
INFO 2024-12-09T14:24:58.368625163Z [resource.labels.taskName: workerpool0-0] 1.11.0
INFO 2024-12-09T14:24:58.397288559Z [resource.labels.taskName: workerpool0-0] 0
ERROR 2024-12-09T14:24:58.398562191Z [resource.labels.taskName: workerpool0-0] Traceback (most recent call last): File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 34, in <module> main() File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 30, in main experiment.run(args) File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 15, in run print(torch.cuda.get_device_name(0)) File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 328, in get_device_name return get_device_properties(device).name File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 358, in get_device_properties _lazy_init() # will define _get_device_properties File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 216, in _lazy_init torch._C._cuda_init() RuntimeError: No CUDA GPUs are available

So the code executes properly, I can read the file, print the torch version, but it does not recognize any GPU devices. Note, this error differs from one I may find locally:

    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index

I had expected when using pre-built containers on Vertex that they would be able to natively use GPU devices with PyTorch, with the proper CUDA drivers installed.

If anyone is able to help me figure out why the prebuilt container is not recognizing the attached GPU, I'd be very thankful.

Kind regards,

Guus

ruthseki

Hi @g_kolpa,

Welcome to Google Cloud Community!

The error message RuntimeError: No CUDA GPUs are available clearly indicates that your PyTorch process isn't seeing the NVIDIA Tesla T4 GPU, despite specifying it in your Vertex AI job's configuration. The problem isn't a lack of drivers, but rather a mismatch between your code and how Vertex AI handles GPU assignment and environment setup.

Here's a breakdown of the potential causes:

Incorrect CUDA_VISIBLE_DEVICES: You're setting CUDA_VISIBLE_DEVICES=3. This variable tells CUDA which GPU to use. However, 0 (and potentially 1, 2 etc.) is the standard way of referring to the first GPU. Crucially, in a containerized environment, the GPU index might be different than what you expect. Your code will likely be running in a container where the GPU is mapped to a different index than 3.
Incorrect GPU assignment in the container: Vertex AI might not be mapping the GPU to the expected index. Even if you specified a GPU, the way it maps to the container environment might not match what your CUDA_VISIBLE_DEVICES variable expects.
Environment variable precedence: The CUDA_VISIBLE_DEVICES setting in your env section might be overridden by other environment variables or configuration details within the container.
Container image incompatibility: Your container image (e.g., europe-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-11:latest) might have a different way of managing GPUs than expected. The exact issue could involve driver versions, systemd setup, or the way libraries are loaded.

Here are some workarounds that you may try:

Verify GPU assignment: This is paramount. Inside your trainer.task script, before attempting to use torch.cuda.get_device_name(0), print out the output of os.environ.get('CUDA_VISIBLE_DEVICES') . You should see the index the container considers the GPU to be at. Crucially, the correct index is likely 0, not 3.
Remove CUDA_VISIBLE_DEVICES: If the container is properly setting up GPU access, setting CUDA_VISIBLE_DEVICES is often unnecessary. Vertex AI's environment should automatically handle the required mappings. Removing this environment variable and trying again can resolve the issue. If there are other libraries that need this environment variable set, a better solution would be to figure out the expected value for the index 0.
Simplify the run function: Reduce the complexity of the function to only call methods that absolutely need to be tested, like torch.cuda.is_available() and torch.cuda.device_count(). Test these methods first.
Use torch.cuda.is_available(): Instead of immediately trying to access the GPU name, verify that a GPU is available. This prevents the error from occurring in the first place.
Examine your container logs: If the problem persists, review the complete logs of your Vertex AI job execution. Look for any errors related to GPU initialization or library loading that might provide additional clues.
Inspect the executorImageUri: Ensure the container image you're using has been tested in Vertex AI and is known to work reliably with the machine type.

Here is a similar case that you may find useful as well.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

View solution in original post

ruthseki