Expected Behavior
I want to use a GPU on a component of Vertex AI.
Actual Behavior
Unfortunately, `torch.cuda.is_available()` is returning `False`. Also, `nvidia-smi` is not working if ran in the container of Vertex AI.
Note: both commands also don't work locally in the container if I'm not specifying the `--gpus all` flag in the command `docker run --rm -it --gpus all ee97db5bbd98 /bin/bash`. However, I can't find any option to add the `--gpus all` flag for Vertex AI. Would this be required?
Steps to Reproduce the Problem
My YAML file:
name: Processing
description: Process all the found HTML
inputs:
- name: friendly_name
type: String
description: The name of the company
- name: language
type: String
description: The language to process
- name: models
type: String
description: The models that will be used (all, genre, or standard)
implementation:
container:
image: eu.gcr.io/uman-interns/backend:v1.7
command:
[
python,
backend/pages/III_Process_website_data/process_website_data.py,
--friendly_name,
{ inputValue: friendly_name },
--language,
{ inputValue: language },
--models,
{ inputValue: models }
]
My pipeline:
rom kfp.v2 import compiler, dsl
import kfp.components as comp
from config import GCS_ARTIFACT_BUCKET, VARS
processing = comp.load_component_from_file("scraping/components/processing.yaml")
embeddings = comp.load_component_from_file("scraping/components/embeddings.yaml")
def compile_pipeline(file_name: str, tag: str):
@dsl.pipeline(
name="scraping",
description="Scrape a site and extract meaningful topics",
pipeline_root=f"gs://{GCS_ARTIFACT_BUCKET}/scraping/{tag}",
)
def pipeline(
friendly_name: str, url: str, language: str, google: bool, models: str) :
PROJECT_ID = VARS["PROJECT_ID"]
process = (
processing(friendly_name, language, models)
.set_display_name("URL processing")
.set_env_variable("PROJECT_ID", PROJECT_ID)
.set_caching_options(enable_caching=False)
.set_cpu_limit("4")
.set_memory_limit("16G")
.add_node_selector_constraint(
"cloud.google.com/gke-accelerator", "NVIDIA_TESLA_T4"
)
.set_gpu_limit(1)
).after(crawling)
embed = (
embeddings(friendly_name, language, models)
.set_display_name("Create embeddings")
.set_env_variable("PROJECT_ID", PROJECT_ID)
.set_caching_options(enable_caching=False)
.set_cpu_limit("4")
.set_memory_limit("16G")
.add_node_selector_constraint(
"cloud.google.com/gke-accelerator", "NVIDIA_TESLA_T4"
)
.set_gpu_limit(1)
).after(process)
compiler.Compiler().compile(pipeline, file_name)
Visualized in the browser:
Does your container properly installs the CUDA and NVIDIA drivers?
eu.gcr.io/uman-interns/backend:v1.7
If not it's probably best to use a cuda ready image as baseimage for your image.
For example
FROM nvidia/cuda:11.0-cudnn8-runtime-ubuntu18.04
Hi sascha,
I'm facing a similar issue. Im using the pytorch 2 cuda image(pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime) as a base image for my torchserve handler to create my custom container. It detects the gpu when i run it locally however when i add it to a coomponent in my vertex ai pipeline, it doesnt use the gpu in its execution. how do i begin to debug and fix this problem. please help. thanks!
Hi,
I'm facing the same issue. My custom container works fine on GCP instance and utilizes the GPU for training. However, the same container, when run as part of Vertex AI Custom training job, runs only on the CPU and does not use the V100 GPU. I do use N1 instance with Nvidia V100 and the job info does show it. Also, no quota issues and the region is North America.
Any suggestions? Thanks!
Issue with the GPUs are in almost all cases related to improper Dockefile.
I put together a notebook that shows two ways (primarily two different dockerfiles) that properly utilize the GPUs with PyTorch.
Full code see here:
https://colab.research.google.com/drive/1leAjgyYZTrbSWZ1m_0duvlrxr1vI2h1e
Option 1 is using Googles pre-build container as base image. I saw you are using Torch 1.11.0 which is supported by those pre-build containers. This approach is the most optimal one as Google optimized those image to run best on their infrastructure.
Option 2 and I understood that is your preferred way is by using a full custom container. Similar to the one that you are using. But I optimized and simplified it a bit to reduce the chances of errors and there are a lot when it comes to CUDA and cudnn.
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |