Hi everyone,
We’ve recently migrated to the Gemini 2.0 Flash model on Vertex AI and are running into persistent DSQ (429 RESOURCE EXHAUSTED) errors. Here’s an overview of our current setup and the challenges we’re seeing:
Request Method: Direct REST calls to the endpoint
https://aiplatform.googleapis.com/v1/projects/{google_project}/locations/{google_location}/publisher...
Location Parameter: Set to global
Authentication: Using a service account, linked to a properly configured Vertex AI project with billing enabled.
Retries: Backoff and retry logic implemented on most services.
Error Rates: 10-15% 429 errors (RESOURCE EXHAUSTED) over 24-hour periods; spikes up to 45% during peak demand.
Quota Dashboard: The Cloud Console shows usage only in eu-west-4, despite requests being sent to global.
Are these error rates (especially the peaks) normal for Gemini 2.0 Flash under heavy load?
Would switching to the GenAI SDK improve reliability or lower 429 errors compared to direct REST calls?
Does setting global as the location automatically route requests to other available regions, or is it limited to one?
Is there any way to programmatically select a region based on available capacity or to load-balance across regions?
In the “Online prediction requests per minute per region” quota, what exactly do the “Value”, “Current usage percentage”, and “Current usage” fields mean in practical terms?
Any guidance or best practices would be greatly appreciated! Has anyone faced similar issues and found effective solutions?
Thanks in advance for your help.
Solved! Go to Solution.
Hi @roshan-poudel,
Welcome to Google Cloud Community!
Please see my answers inline with your questions below:
Are these error rates (especially the peaks) normal for Gemini 2.0 Flash under heavy load?
Dynamic shared quota (DSQ) provides access to a large, shared pool of resources that are dynamically allocated across all customers based on demand for a specific model (Gemini 2.0 Flash). With DSQ, a 429 error indicates that the overall pool of shared resources has been exhausted due to high demand from many users simultaneously for that particular model. It can also occur with asynchronous calls to the model that involve large multimodal inputs. To better understand the 429 error in the context of Dynamic shared quota (DSQ), you can refer to this documentation.
Would switching to the GenAI SDK improve reliability or lower 429 errors compared to direct REST calls?
A 429 error indicates that the resource has been exhausted. Switching to the GenAI SDK may not necessarily resolve the 429 errors, but it may provide better error handling and observability.
Does setting global as the location automatically route requests to other available regions, or is it limited to one? Is there any way to programmatically select a region based on available capacity or to load-balance across regions?
Global endpoint cover the entire world and route requests internally with automatic load balancing across regions which you do not have direct control or visibility, providing higher availability and reducing 429 errors. However, using a global endpoint is not advisable if you have ML processing or data residency requirements. Alternatively, you can manually configure requests to target specific regional endpoints. But as a best practice to address the 429 error, it is recommended to use the global endpoint instead of a regional endpoint whenever possible.
In the “Online prediction requests per minute per region” quota, what exactly do the “Value”, “Current usage percentage”, and “Current usage” fields mean in practical terms?
Under the "Online prediction requests per minute per region" quota and system limits, "Value" represents the maximum allowable quota/limit, "Current usage" reflects the actual number of requests made, and "Current usage percentage" indicates the percentage of your usage relative to the maximum limit. However, please note that under the Dynamic shared quota (DSQ), your usage has no predefined quota limits, which eliminates the need to manage quotas or submit quota increase requests.
In addition, since you've already exhausted efforts to resolve the 429 error, including exploring global endpoints, retry strategies, and Dynamic shared quota (DSQ), you may want to consider using Provisioned Throughput for a more consistent level of service. This option provides reserved dedicated capacity to avoid resource contention or queuing.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi @roshan-poudel,
Welcome to Google Cloud Community!
Please see my answers inline with your questions below:
Are these error rates (especially the peaks) normal for Gemini 2.0 Flash under heavy load?
Dynamic shared quota (DSQ) provides access to a large, shared pool of resources that are dynamically allocated across all customers based on demand for a specific model (Gemini 2.0 Flash). With DSQ, a 429 error indicates that the overall pool of shared resources has been exhausted due to high demand from many users simultaneously for that particular model. It can also occur with asynchronous calls to the model that involve large multimodal inputs. To better understand the 429 error in the context of Dynamic shared quota (DSQ), you can refer to this documentation.
Would switching to the GenAI SDK improve reliability or lower 429 errors compared to direct REST calls?
A 429 error indicates that the resource has been exhausted. Switching to the GenAI SDK may not necessarily resolve the 429 errors, but it may provide better error handling and observability.
Does setting global as the location automatically route requests to other available regions, or is it limited to one? Is there any way to programmatically select a region based on available capacity or to load-balance across regions?
Global endpoint cover the entire world and route requests internally with automatic load balancing across regions which you do not have direct control or visibility, providing higher availability and reducing 429 errors. However, using a global endpoint is not advisable if you have ML processing or data residency requirements. Alternatively, you can manually configure requests to target specific regional endpoints. But as a best practice to address the 429 error, it is recommended to use the global endpoint instead of a regional endpoint whenever possible.
In the “Online prediction requests per minute per region” quota, what exactly do the “Value”, “Current usage percentage”, and “Current usage” fields mean in practical terms?
Under the "Online prediction requests per minute per region" quota and system limits, "Value" represents the maximum allowable quota/limit, "Current usage" reflects the actual number of requests made, and "Current usage percentage" indicates the percentage of your usage relative to the maximum limit. However, please note that under the Dynamic shared quota (DSQ), your usage has no predefined quota limits, which eliminates the need to manage quotas or submit quota increase requests.
In addition, since you've already exhausted efforts to resolve the 429 error, including exploring global endpoints, retry strategies, and Dynamic shared quota (DSQ), you may want to consider using Provisioned Throughput for a more consistent level of service. This option provides reserved dedicated capacity to avoid resource contention or queuing.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.