I'm migrating our training pipeline to Vertex AI. I managed to get quite far but I'm running into the fact that the GPU is not recognized. I've been debugging and simplifying to see if I can get the GPU recognized, and I'm running into the base issue here.
The job input specification can be seen here:
{
"workerPoolSpecs": [
{
"machineSpec": {
"machineType": "n1-standard-8",
"acceleratorType": "NVIDIA_TESLA_T4",
"acceleratorCount": 1
},
"replicaCount": "1",
"diskSpec": {
"bootDiskType": "pd-ssd",
"bootDiskSizeGb": 100
},
"pythonPackageSpec": {
"executorImageUri": "europe-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-11:latest",
"packageUris": [
"gs://training-data-prostate/trainer-0.1.tar.gz"
],
"pythonModule": "trainer.task",
"args": [
"--batch-size",
"8"
],
"env": [
{
"name": "SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL",
"value": "True"
},
{
"name": "CUDA_VISIBLE_DEVICES",
"value": "3"
},
{
"name": "det_models",
"value": "/gcs/training-data-prostate/output"
},
{
"name": "OMP_NUM_THREADS",
"value": "1"
}
]
}
}
],
"baseOutputDirectory": {
"outputUriPrefix": "gs://training-data-prostate/aiplatform-custom-training-2024-12-09-15:23:47.778"
}
}
The environment variables can be ignored.
Of note, this message has occurred often, regardless of the image being used. I have also tried with
The code being run in trainer.task performs the following:
import os
import torch
def run(args):
print(f"Printing args: {args}")
input_data_url = f"/gcs/training-data/t2.nii.gz"
with open(input_data_url, 'rb') as f:
f.seek(0, os.SEEK_END)
# Get the size of the file
file_size = f.tell()
print(f'Done! file size: {file_size}')
print(torch.__version__)
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
Finally, the output logs after succesful job set-up look like this:
INFO 2024-12-09T14:24:57.497426271Z [resource.labels.taskName: workerpool0-0] Running command: python3 -m trainer.task --batch-size 8
INFO 2024-12-09T14:24:58.367860316Z [resource.labels.taskName: workerpool0-0] Namespace(batch_size=8)
INFO 2024-12-09T14:24:58.368443489Z [resource.labels.taskName: workerpool0-0] Printing args: Namespace(batch_size=8)
INFO 2024-12-09T14:24:58.368552923Z [resource.labels.taskName: workerpool0-0] Done! file size: 9612380
INFO 2024-12-09T14:24:58.368625163Z [resource.labels.taskName: workerpool0-0] 1.11.0
INFO 2024-12-09T14:24:58.397288559Z [resource.labels.taskName: workerpool0-0] 0
ERROR 2024-12-09T14:24:58.398562191Z [resource.labels.taskName: workerpool0-0] Traceback (most recent call last): File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 34, in <module> main() File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 30, in main experiment.run(args) File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 15, in run print(torch.cuda.get_device_name(0)) File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 328, in get_device_name return get_device_properties(device).name File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 358, in get_device_properties _lazy_init() # will define _get_device_properties File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 216, in _lazy_init torch._C._cuda_init() RuntimeError: No CUDA GPUs are available
So the code executes properly, I can read the file, print the torch version, but it does not recognize any GPU devices. Note, this error differs from one I may find locally:
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index
I had expected when using pre-built containers on Vertex that they would be able to natively use GPU devices with PyTorch, with the proper CUDA drivers installed.
If anyone is able to help me figure out why the prebuilt container is not recognizing the attached GPU, I'd be very thankful.
Kind regards,
Guus
Solved! Go to Solution.
Hi @g_kolpa,
Welcome to Google Cloud Community!
The error message RuntimeError: No CUDA GPUs are available clearly indicates that your PyTorch process isn't seeing the NVIDIA Tesla T4 GPU, despite specifying it in your Vertex AI job's configuration. The problem isn't a lack of drivers, but rather a mismatch between your code and how Vertex AI handles GPU assignment and environment setup.
Here's a breakdown of the potential causes:
Here are some workarounds that you may try:
Here is a similar case that you may find useful as well.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi @g_kolpa,
Welcome to Google Cloud Community!
The error message RuntimeError: No CUDA GPUs are available clearly indicates that your PyTorch process isn't seeing the NVIDIA Tesla T4 GPU, despite specifying it in your Vertex AI job's configuration. The problem isn't a lack of drivers, but rather a mismatch between your code and how Vertex AI handles GPU assignment and environment setup.
Here's a breakdown of the potential causes:
Here are some workarounds that you may try:
Here is a similar case that you may find useful as well.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Thanks for the reply, indeed the issue stemmed CUDA_VISIBLE_DEVICES=3, which I kept using. After removing that, I was able to start up the training.
Thanks for the reply.
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |