Vertex AI fine tuning LLM model

davmol94ISP · 07-13-2023 08:14 AM

Hi,

I'm new to GCP and I was trying to run a LLM tuning process in Vertex AI.

I upload my data in the jsonl format in a bucket and selected it to start the tuning process. During the pipeline, I got this error:

com.google.cloud.ai.platform.common.errors.AiPlatformException: code=RESOURCE_EXHAUSTED, message=The following quota metrics exceed quota limits: aiplatform.googleapis.com/restricted_image_training_tpu_v3_pod, cause=null; Failed to create custom job.Project number: 162269030045, Job id: 8911904581961646080, Task id: -5724177380969283584, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/162269030045/locations/europe-west4/metadataStores/default/executions/9317118956493817351; Failed to create external task or refresh its state. Task:Project number: 162269030045, Job id: 8911904581961646080, Task id: -5724177380969283584, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/162269030045/locations/europe-west4/metadataStores/default/executions/9317118956493817351; Failed to handle the pipeline task. Task: Project number: 162269030045, Job id: 8911904581961646080, Task id: -5724177380969283584, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/162269030045/locations/europe-west4/metadataStores/default/executions/9317118956493817351

So I looked online for a solution/work around to this problem. I found that some users were resolving it by updating their quotas. Then in the error message I've seen that my limit are reached for the europe-west4 (if I've understood correctly). That's what I'm trying to do right now. Do some of you guys got the same error and can give me some advice to fix it?

I look forward to hearing back from you!

Thank you so much!

kvandres

Good day @davmol94ISP,

Welcome to Google Cloud Community!

As a note when you create a model tuning job, you need to make sure that you have enough quota for the tuning location, tuning jobs in europe-west4 uses 64 cores of the TPU v3 Pod. https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-models#create_a_model_tuning_job
To check your current quota in europe-west4 you can go to API & Services > Vertex AI API > Quotas
Add this to your filter: region:europe-west4 Quota:Restricted image training TPU V3 pod cores per region
This will filter your current limit to tune jobs in europe-west4, if you don't have enough quota, You need to file a request to increase your quota for Restricted image training TPU V3 pod cores per region in the region europe-west4 in multiples of 64. This is the same case, if you need to run multiple concurrent tuning jobs in your project. You can visit this link to learn more: https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-models#quota
Here is a step by step process on how you can request a higher quota limit: https://cloud.google.com/docs/quota_detail/view_manage#requesting_higher_quota
Please note that your request for quota increase is subject to approval, Cloud Customer Care will process your request around 2 to 3 days and they will send you an email if your quota increase is approved.

Hope this helps!