Vertex AI training

hadi-ibra · 07-23-2024 03:48 AM

Hello everyone, I am trying to fine tune llama3 model but I an error occurred that the region I am using doesn't have the custom model training Nvidia A100 gpu per region or L4 so I searched which region has one of them but I noticed there is no region with this specified quota, and I find it weird because I can't train on vertex AI without it. is it just currently that it doesn't have an available one or is it in general that there is no region that has this quota, if so what can I do?

ruthseki

Hi @hadi-ibra,

Welcome to Google Cloud Community!

The error "region doesn't have the custom model training Nvidia A100 GPU per region or L4" means the region you're trying to use for training doesn't have enough A100 GPUs or L4 GPUs dedicated to custom model training. This is a common issue when demand for high-performance GPUs is high or possibly that you encounter a stockout issue in a region.

In addition, Vertex AI, like many cloud services, has quotas for resource usage, including GPU usage. These quotas are designed to ensure fairness and prevent resource exhaustion.

Here are the workarounds that you may try:

Consider verifying your quota. Check your Google Cloud project quotas in the Google Cloud Console. Go to IAM & Admin > Quotas & System Limits and look for the relevant quota for custom model training GPU usage. See sample image below for reference:

Quotas and System Limits.png

Or you can directly click “MANAGE QUOTAS” when you are deploying a Llama3 model. See image below for reference:

Llama3 deploy model page.png

You may also try to use the Google Cloud API to query your quota.

If you're sure your project requires more GPU resources, submit a quota increase by clicking the Edit quota, same as shown below or you may request from Google Cloud Customer Care. Explain your use case, the model size, and why you need more resources. They may be able to adjust your quotas.

edit quota.png

You may try to consider selecting alternative regions. It's possible that a different region might have more available GPUs, though this isn't guaranteed. Check the Google Cloud documentation for region-specific GPU availability.

Another recommendation is to explore smaller models or different architectures that may be less computationally intensive.

You may also use other Vertex AI options. If your project isn't time-sensitive, consider using Vertex AI's "Managed Model Training" options, which often have more relaxed GPU quotas and can be easier to manage.

Lastly, if your project involves inference on a large dataset, you can consider using Vertex AI's Batch Prediction service to process data in a more resource-efficient way.

By understanding the error, verifying your quota, and exploring alternatives, you can increase the chances of successfully training your model with Vertex AI. Remember that Google Cloud is constantly updating its resources, so it's a good practice to periodically check for updates.

I hope the above information is helpful.

hadi-ibra

I did as instructed and used the region: asia-east2 and checked the quota required and it returned the error below when trying to fine tune the llama3 model:

ValueError: Quota not enough for custom_model_training_nvidia_l4_gpus in asia-east2: 0 < 4. Either use a different region or request additional quota. Follow instructions here https://cloud.google.com/docs/quotas/view-manage#requesting_higher_quota to check quota in a region or request additional quota for your project.

hadi-ibra

I also did try to request additional quota since the value is 0 but they replied:
"We cannot provide you with the requested resource as Custom model serving Nvidia A100 80GB GPUs is not yet available in asia-east2."
Even though I tried it for several regions, which region has available custom model serving nvidia A100 80GB GPUs and L4 so that I can try to request additional quota for it.