Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How Vertex AI rate limits are calculated on GCP?

I'm planning to use Google Cloud Platform's Vertex AI for a few projects. So, I was looking through the documentation in the section on rate limits and I came across this:

diegol116_0-1729883856007.png

https://cloud.google.com/vertex-ai/generative-ai/docs/quotas

But I haven't found any information anywhere about the algorithm that sets these limits. That is, I have two scenarios in my mind:

  • First scenario: The limits are at fixed times. For example, between 08:00:00 AM and 08:00:59 AM there are 4 million tokens available and at 08:01:00 AM the tokens are reset.
  • Second scenario: The limits move as requests are made.

Or maybe it's different from the scenarios outlined.

I would appreciate if someone could explain to me how Google calculates it, or if there is a section of the documentation where I can find this since I haven't seen it.

0 1 509
1 REPLY 1

Hi @diegol116,

Welcome to Google Cloud Community!

Vertex AI Generative AI quotas are calculated based on the number of requests per minute (RPM) for a base model and all its versions, identifiers, and tuned versions. Unfortunately, Google doesn't publicly disclose the exact algorithm used to calculate these limits. The quotas apply to requests for a given Google Cloud project and supported region. Additionally, there are quotas for specific services like RAG Engine and Gen AI Evaluation Service. Some quotas are shared across all applications and IP addresses within a Google Cloud project.

I hope the above information is helpful.