Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Constant 429 Errors with gemini-2.0-flash-thinking-exp-01-21 API on Vertex AI

Hi everyone,

I'm encountering consistent 429 errors (rate limit exceeded) when using the Vertex AI gemini-2.0-flash-thinking-exp-01-21 API. Here’s a brief summary of my situation:

  • Yesterday, I used Vertex AI to process data with the gemini-2.0-flash-thinking-exp-01-21 API.

  • After processing around 5000 data items, I started receiving 429 errors.

  • When I interrupted the task and restarted it, all subsequent requests returned a 429 error.

  • My code employs a ThreadPoolExecutor from Python’s concurrent.futures with max_workers set to 32. Even reducing max_workers to 16 doesn't alleviate the issue.

 Has anyone experienced similar issues or can offer insights into why this might be happening?

Thanks in advance for your help!

2 REPLIES 2

Hi @laolaorkkkkk,

Welcome to the Google Cloud Community!

It looks like you are encountering 429 Too Many Requests errors due to hitting the rate limits of the Vertex AI gemini-2.0-flash-thinking-exp-01-21 API. Despite reducing your concurrency, the issue persists, suggesting the rate limit is likely applied at your project or user level rather than just per connection.

Here are the potential ways that might help with your use case:

  • Identify the Specific Rate Limit: You may want to check the Quotas page in the Google Cloud Console by navigating to IAM & Admin -> Quotas in your project. Look for quotas related to Vertex AI, particularly the Gemini API, to view your usage limits (e.g., requests per minute/day) and your current consumption.
  • Implement Retry Logic with Exponential Backoff: You may implement proper retry logic with exponential backoff, as this is essential for managing rate limits effectively. When you encounter a 429 response, avoid retrying immediately. Instead, wait for a short period before retrying, and if you receive another 429, progressively increase your waiting time before subsequent attempts.
  • Batch Request: If your Gemini API supports batching, you can consolidate multiple data items into a single request. This reduces the number of your API calls and can improve your efficiency.
  • Request a Quota Increase: Enhancing your efficiency, reliability, and scalability in data processing with the Gemini API, you can request a quota increase if you consistently require higher throughput.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Hi, it seems that a quota of 10 value( default value) isn't enough. I also emailed Google Cloud Platform Support, but they told me that the quota for the experimental API cannot be increased