Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

PyTorch is using the GPU on a container on my local machine, but is unable to use the GPU on Vertex

Expected Behavior
I want to use a GPU on a component of Vertex AI.

Actual Behavior
Unfortunately, `torch.cuda.is_available()` is returning `False`. Also, `nvidia-smi` is not working if ran in the container of Vertex AI.
Note: both commands also don't work locally in the container if I'm not specifying the `--gpus all` flag in the command `docker run --rm -it --gpus all ee97db5bbd98 /bin/bash`. However, I can't find any option to add the `--gpus all` flag for Vertex AI. Would this be required?

Steps to Reproduce the Problem

My YAML file:

name: Processing
description: Process all the found HTML

inputs:
- name: friendly_name
type: String
description: The name of the company
- name: language
type: String
description: The language to process
- name: models
type: String
description: The models that will be used (all, genre, or standard)
implementation:
container:
image: eu.gcr.io/uman-interns/backend:v1.7
command:
[
python,
backend/pages/III_Process_website_data/process_website_data.py,
--friendly_name,
{ inputValue: friendly_name },
--language,
{ inputValue: language },
--models,
{ inputValue: models }
]

 

My pipeline:

rom kfp.v2 import compiler, dsl
import kfp.components as comp

from config import GCS_ARTIFACT_BUCKET, VARS

processing = comp.load_component_from_file("scraping/components/processing.yaml")
embeddings = comp.load_component_from_file("scraping/components/embeddings.yaml")


def compile_pipeline(file_name: str, tag: str):
@dsl.pipeline(
name="scraping",
description="Scrape a site and extract meaningful topics",
pipeline_root=f"gs://{GCS_ARTIFACT_BUCKET}/scraping/{tag}",
)
def pipeline(
friendly_name: str, url: str, language: str, google: bool, models: str) :
PROJECT_ID = VARS["PROJECT_ID"]

process = (
processing(friendly_name, language, models)
.set_display_name("URL processing")
.set_env_variable("PROJECT_ID", PROJECT_ID)
.set_caching_options(enable_caching=False)
.set_cpu_limit("4")
.set_memory_limit("16G")
.add_node_selector_constraint(
"cloud.google.com/gke-accelerator", "NVIDIA_TESLA_T4"
)
.set_gpu_limit(1)
).after(crawling)
embed = (
embeddings(friendly_name, language, models)
.set_display_name("Create embeddings")
.set_env_variable("PROJECT_ID", PROJECT_ID)
.set_caching_options(enable_caching=False)
.set_cpu_limit("4")
.set_memory_limit("16G")
.add_node_selector_constraint(
"cloud.google.com/gke-accelerator", "NVIDIA_TESLA_T4"
)
.set_gpu_limit(1)
).after(process)

compiler.Compiler().compile(pipeline, file_name)


Visualized in the browser:

Screenshot from 2022-08-23 16-10-18.png

1 4 1,802
4 REPLIES 4

Does your container properly installs the CUDA and NVIDIA drivers?

eu.gcr.io/uman-interns/backend:v1.7

If not it's probably best to use a cuda ready image as baseimage for your image. 
For example

FROM nvidia/cuda:11.0-cudnn8-runtime-ubuntu18.04

Hi sascha,

I'm facing a similar issue. Im using the pytorch 2 cuda image(pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime) as a base image for my torchserve handler to create my custom container. It detects the gpu when i run it locally however when i add it to a coomponent in my vertex ai pipeline, it doesnt use the gpu in its execution. how do i begin to debug and fix this problem. please help. thanks!

Hi, 
  I'm facing the same issue. My custom container works fine on GCP instance and utilizes the GPU for training. However, the same container, when run as part of Vertex AI Custom training job, runs only on the CPU and does not use the V100 GPU. I do use N1 instance with Nvidia V100 and the job info does show it. Also, no quota issues and the region is North America. 

Any suggestions?  Thanks!

Issue with the GPUs are in almost all cases related to improper Dockefile. 

I put together a notebook that shows two ways (primarily two different dockerfiles) that properly utilize the GPUs with PyTorch.  
 
Full code see here:
https://colab.research.google.com/drive/1leAjgyYZTrbSWZ1m_0duvlrxr1vI2h1e
 
Option 1 is using Googles pre-build container as base image. I saw you are using Torch 1.11.0 which is supported by those pre-build containers. This approach is the most optimal one as Google optimized those image to run best on their infrastructure. 
 
Option 2 and I understood that is your preferred way is by using a full custom container. Similar to the one that you are using. But I optimized and simplified it a bit to reduce the chances of errors and there are a lot when it comes to CUDA and cudnn.