Hi everyone,
I am trying to launch a Vertex AI CustomJob Training on europe-west1 using a T4 GPU.
It's been two days I keep receiving a "Insufficient Ressources" denial and after like 50 trials, I wonder if I am the only one experimenting this issue. Does anyone managed to trigger a training job on T4 ?
I've checked and my quotas are ok.
Thanks for the feedback
I was able to do training on T4. Also, per the service health dashboard, there was an issue last March 3 but it was specifically for us-central1, us-east1, and europe-west3, and it was resolved a day after. Running also on cloudshell the command below shows T4 is available in europe-west1-b/c/d
gcloud compute accelerator-types list
Hi Joevanie,
Thanks for your input.
You say you managed to 'gcloud ai custom-jobs create --region=europe-west1 ...' with a T4 accelerator recently ?
I've just re(re)tried and it keeps on erroring with 'Resources are insufficient in region: europe-west1. Please try a different region. If you use K80, please consider using P100 or V100 instead.'
Also, I believe
gcloud compute accelerator-types list
lists the theorically available accelerators per region, but it does not realtime check for effective availability. I could not find any place nor tool to have this realtime availability check.
I've read here and there that when a region gets shorts on GPU, the most valuable customers get prioritized by Google. As I am not yet spending a lot on GCP, I guess this could explain why I can't get a GPU. But as long as I can't train my models, I have no way to spend my budget on GCP neither. Hope this prioritize the good customer is only a rumor.
@LawrenceAlgocat wrote:
You say you managed to 'gcloud ai custom-jobs create --region=europe-west1 ...' with a T4 accelerator recently ?
I did it using the console. Can I ask what documentation did you follow? You may want to check this out. Also, have you tried choosing a different region/gpu just to test?
Hi Joevanie,
I followed the documentation you linked as well as the CLI docs.
The job works great when ran without accelerator.
workerPoolSpecs:
machineSpec:
machineType: n1-standard-8
replicaCount: 1
containerSpec:
[...]
As soon as I request a GPU with accelerator-type and accelerator-count it fails with "Insufficient ressources" error.
workerPoolSpecs:
machineSpec:
machineType: n1-standard-8
acceleratorType: NVIDIA_TESLA_T4
acceleratorCount: 1
replicaCount: 1
containerSpec:
[...]
As of now, I only have quotas available for " Custom model training Nvidia T4 GPUs per region" on europe-west1. All other regions are 0 (see screenshot). I've asked to raise the same quota on all european regions, but still waiting for an answer from GCP team. So I can't test yet elsewhere.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |