Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Gemini 1.5 Flash Online Prediction Quota Exceeded - Urgent Help Needed!

We're encountering a significant roadblock with our Gemini 1.5 Flash models. We have five models working together, and we're constantly hitting the "Online prediction request quota exceeded for gemini-1.5-flash" error. This is severely impacting our project's progress.

We've checked our project's quotas and system limits in the Google Cloud console, and none of the relevant quotas appear to be anywhere near their maximum. We're struggling to find any documentation specifying what these online prediction quotas are or how to increase them.

Has anyone else encountered this issue? Does anyone know where we can find information about these quotas and how to request a limit increase? Any help or pointers would be greatly appreciated!

1 8 1,858
8 REPLIES 8

Hi @orkhestrai_1,

Welcome to Google Cloud Community!

It seems that the number of your requests exceeds the capacity allocated to process requests. This capacity is shared among thousand of users. If you are receiving error code 429, you may try to send a request at a later time when resources are freed. Also as mentioned on this documentation, Gemini 1.5 Flash has dynamic quota which means that quota distributes on-demand capacity among all queries being processed by Google Cloud services.

As a workaround, I suggest to reserve the capacity by using Provisioned Throughput as a subscription. For quota increase, you may check this documentation.

Hope this helps.

Same problem here. So annoying. I have to switch to use another AI service provider to keep my services up and running.
I checked no quota warning in console. And I requested quota to 30,000 already.

Error:  429 Online prediction request quota exceeded for gemini-1.5-flash. Please try again later with backoff.

I am sure that my case does not over tokens per minute and request per minutes. 

We receieved the same issue last Monday, with version 002, after a week of succeful Gemini requests. Our solution was to go back to 001 during the meantime.
There is also a possibility to use Langchain to have a backup region if one region is clogged.
If it is a Production Environment, the suggestion is to buy dedicated GSU's via Provisioned Throughput.

The quota warning will not send you an e-mail as this is on Google Data Center side.

You have pointed out a good point to fallback to 001. I didn't think about this approach. Anything, I stopped using Gemini for production until it becomes stable.

Hi everyone.

Thanks for all the answers and insights, we've gone the way of using backoff and several regions for mitigating the issue. It's working better now, although more slowly.

We can't fallback to 001 because the results are much worse than using 002 unfortunately.

Cheers.

Could you mind sharing your backoff approach? Set delay to each request?

That sounds like an interesting solution - I am also interested in the backoff approach.