Re: GPU shortage on europe-west1

LawrenceAlgocat · 03-07-2023 09:39 AM

Hi everyone,

I am trying to launch a Vertex AI CustomJob Training on europe-west1 using a T4 GPU.

It's been two days I keep receiving a "Insufficient Ressources" denial and after like 50 trials, I wonder if I am the only one experimenting this issue. Does anyone managed to trigger a training job on T4 ?

I've checked and my quotas are ok.

Thanks for the feedback

Joevanie

I was able to do training on T4. Also, per the service health dashboard, there was an issue last March 3 but it was specifically for us-central1, us-east1, and europe-west3, and it was resolved a day after. Running also on cloudshell the command below shows T4 is available in europe-west1-b/c/d

gcloud compute accelerator-types list

LawrenceAlgocat

Hi Joevanie,

Thanks for your input.
You say you managed to 'gcloud ai custom-jobs create --region=europe-west1 ...' with a T4 accelerator recently ?

I've just re(re)tried and it keeps on erroring with 'Resources are insufficient in region: europe-west1. Please try a different region. If you use K80, please consider using P100 or V100 instead.'

Also, I believe

gcloud compute accelerator-types list

lists the theorically available accelerators per region, but it does not realtime check for effective availability. I could not find any place nor tool to have this realtime availability check.

I've read here and there that when a region gets shorts on GPU, the most valuable customers get prioritized by Google. As I am not yet spending a lot on GCP, I guess this could explain why I can't get a GPU. But as long as I can't train my models, I have no way to spend my budget on GCP neither. Hope this prioritize the good customer is only a rumor.

Joevanie

@LawrenceAlgocat wrote:

You say you managed to 'gcloud ai custom-jobs create --region=europe-west1 ...' with a T4 accelerator recently ?

I did it using the console. Can I ask what documentation did you follow? You may want to check this out. Also, have you tried choosing a different region/gpu just to test?

LawrenceAlgocat

Hi Joevanie,

I followed the documentation you linked as well as the CLI docs.

The job works great when ran without accelerator.

workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-8
  replicaCount: 1
  containerSpec:
    [...]

As soon as I request a GPU with accelerator-type and accelerator-count it fails with "Insufficient ressources" error.

workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-8
    acceleratorType: NVIDIA_TESLA_T4
    acceleratorCount: 1
  replicaCount: 1
  containerSpec:
    [...]

As of now, I only have quotas available for " Custom model training Nvidia T4 GPUs per region" on europe-west1. All other regions are 0 (see screenshot). I've asked to raise the same quota on all european regions, but still waiting for an answer from GCP team. So I can't test yet elsewhere.