Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Why is Google Cloud's Generative Model quota not respected by the API?

I am using a fine tuned model of gemini-1.5-flash to process large amounts of data (worth noting same issue happens with the base model alone too). I have increased my quotas for "Generate content requests per minute per project per base model per minute per region per base_model" for the proper base model and region.

Strangely, now my usage doesnt appear in the "Quota" page on Google Cloud and my calls to the generate_content endpoint seem to be restricted to the base rate limit of 5 / min / region even after the quota increases.

The specific error I am getting now is 429 Online prediction request quota exceeded for gemini-1.5-flash. Please try again later with backoff. - which is different from the errors I got before the quote increase which referenced the specific quote I was limited by.

I would use the batch processing functionality, but it appears the fine tuned Gemini models do not support batch processing.

Attempted increasing quotas to avoid rate limit errors, resulted in unclear rate limit errors and the quotas page breaking.

1 9 2,066
9 REPLIES 9

Hi @camelmnts,

Welcome to Google Cloud Community!

According to this Nov 8, 2024 release note:

Batch prediction is available for Gemini in General Availability (GA). Available Gemini models include Gemini 1.0 Pro, Gemini 1.5 Pro, and Gemini 1.5 Flash. To get started with batch prediction, see Get batch predictions for Gemini.

Alternatively, you can also refer to this documentation for more information on batch processing and Gemini 1.5 Flash limitations. 

With regard to Error code 429, if the number of your requests exceeds the capacity allocated to process requests, then error code 429 is returned. You may check this page for guidance on how to rectify this issue. 

Additional key to remember that:

Quota is enforced on the number of concurrent tuning jobs. Every project comes with a default quota to run at least one tuning job. This is a global quota, shared across all available regions and supported models. If you want to run more jobs concurrently, you need to request additional quota for Global concurrent tuning jobs.

Furthermore, you may try these workarounds that may help you possibly resolve your concern:

  1. Verify Quota Application and Propagation:
  • Time Lag: It might take some time (potentially hours, though unlikely to be days) for quota changes to fully propagate across Google Cloud's infrastructure. Wait a significant period (e.g., 4-6 hours) before further testing.
  • Quota Project ID: Double and triple-check that you've applied the quota increase to the correct Google Cloud project ID that your code is using to access the Gemini model. A simple typo can cause this.
  • Specific Quota Name: The quota name you're adjusting might not be the one your requests are hitting. Carefully examine the error message and documentation for the precise quota name Google is referencing. There might be more granular quotas at play than you initially realized. Look for quotas related to predictions, tokens processed, or inference requests—not just the generic "Generate content" quota.
  • Billing Account: Ensure your project is correctly linked to a billing account with sufficient funds. Quota increases sometimes require billing account validation.
  1. Investigate the "429" Error Further:
  • Request Rate Monitoring: Implement more rigorous logging and monitoring of your API requests. Record timestamps, request payloads (sanitized if necessary), and response codes. This detailed logging will help pinpoint the exact moment and conditions under which the 429 error occurs. 
  • Error Details: Carefully examine the entire error response from the generate_content endpoint. There might be additional details in the response body beyond the main error message that provide clues about the specific quota or resource limit being exceeded. (Consider adding error handling to gracefully capture the full response and log it.)
  • Backoff Strategy: Your code should include exponential backoff. A simple linear backoff (waiting a fixed time) is insufficient. If you get a 429, wait, then try again. If you get another 429, wait longer (e.g., double the time), and so on.
  1. Contact Google Cloud Support:

Google Cloud Support can investigate your account's quota settings and usage patterns to identify the root cause. Provide them with:

  • Your project ID
  • Detailed logs showing request timestamps, response codes, and the full error responses
  • Screenshots of your quota settings page (both before and after the quota increase)
  • The code snippet you're using to make the generate_content requests.
  1. Alternative Approaches:
  • Smaller Batches: you may experiment with breaking down your data into smaller chunks and processing them sequentially with appropriate backoff. This might reduce the likelihood of hitting the rate limit.
  • Different Model: As a last resort, consider temporarily using a different Gemini model that also supports batch processing to move forward with your task while waiting for Google Cloud Support to resolve the issue.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Hey Ruthseki,

I appreciate your detailed response, I am going to assume based on the release notes that you provided that fine tuned Gemini models are not supported for batch processing. Furthermore, when going to Console > Vertex AI > Batch Prediction > Create > Version Select (on gemini-1.5-flash Fine Tuned Model) > I get the error "Model is not supported for batch prediction." This is unfortunate because it would resolve the rate-limit issue entirely, but I will monitor the change-log, thank you for the link!

Quota debugging

  • Deployed model (endpoint - Console > Vertex AI > Online Prediction), I am using the deployed model path  "projects/X/locations/us-central1/endpoints/X". I CAN SEE in the monitoring for that endpoint ~5 req/sec with 90% returning response 429.
  • Here is the raw 429 error
    • {'error': {'code': 429, 'message': 'Online prediction request quota exceeded for gemini-1.5-flash. Please try again later with backoff.', 'status': 'RESOURCE_EXHAUSTED'}}
  • After hitting the API for 72 HOURS consistently hitting the rate-limit, this is what I observe in the Console > Quotas
    • "Vertex AI" - "Online prediction requests per minute per region" - "region : us-central1"
      • 48 / 30,000 used - properly updating and showing graph usage well below quota
    • "Vertex AI" - "Generate content requests per minute per project per base model per minute per region per base_model" - "region: us-central1 & base_model = gemini-1.5-flash"
      • 0 / 100 used - NO UPDATES to graph or usage since quota was increased from 4 to 100
      • Screenshot 2024-12-11 at 6.09.20 PM.png

Have you seen this behavior of a quota suddenly not reporting anymore? I am definitely above the 4 RPM limit by default per region per model, but defininitely not at the full 100 RPM on my quota. I dont see any other quota's across my entire project that are being used besides the one mentioned ("Online prediction requests per minute per region")

Any advice or further debugging tips are appreciated!

Hi @camelmnts,

Thank you for providing the image and the detailed breakdown of your testing and findings. 

You need to raise the "Model is not supported for batch prediction" issue with Google Cloud Support or you can file a defect report. Provide them with the details, including the release notes, your screenshots, and the precise steps to reproduce the error. 

When reporting the error, be sure to provide the screenshot so they see exactly what you are seeing.

You also need to report the fact that "Generate content requests per minute per project per base model per minute per region per base_model" is not updating properly and that it is showing 0% despite usage, and not matching your actual usage. Make sure to mention your region, base_model and that you have already increased the quota to 100 but the problem persisted.

I hope the above information is helpful.

Published the defect report here. https://issuetracker.google.com/issues/383821626
I am unfortunately unable to make an Organization on my account so I cant get support. I am also unable to transfer my fine tuned model to my company google account where I could setup an organization and get support.

Any ideas for another way to get support from GCP?

Hi @camelmnts,

In your case, I suggest keeping an eye on the defect report that you have published to see the update. I'm seeing that it's already assigned to our Engineering Team. Note that there’s no definite date as to when this will be implemented.

I hope the above information is helpful.

I am having same problem with the 429 errors.  I do have an organization and support but the solution they provided was to apply for more quota, but the docs say that 1 is the maximum which we  are already at?  I am only sending text prompts there should be no reason why system is overloaded?  This happens in studio and inside of Genkit.  It is making it impossible to use the technology for building an application.  They are no saying the only time to meet to discuss this issue is on Christmas Eve?  Google please help!

Hi @camelmnts  did you ever resolve this issue?  What was the resolution?

No fix, see issue tracker here

From the documentation it seems as though using gemini-1.5-pro-002 or gemini-1.5-flash-002 models use Dynamic Shared Quota (DSQ), meaning the standard quota increase doesn’t apply.  To guarantee higher throughput, Google requires Provisioned Throughput, which is in Preview and requires requesting access.