Re: How Vertex AI rate limits are calculated on GC...

diegol116 · 10-25-2024 12:19 PM

I'm planning to use Google Cloud Platform's Vertex AI for a few projects. So, I was looking through the documentation in the section on rate limits and I came across this:

https://cloud.google.com/vertex-ai/generative-ai/docs/quotas

But I haven't found any information anywhere about the algorithm that sets these limits. That is, I have two scenarios in my mind:

First scenario: The limits are at fixed times. For example, between 08:00:00 AM and 08:00:59 AM there are 4 million tokens available and at 08:01:00 AM the tokens are reset.
Second scenario: The limits move as requests are made.

Or maybe it's different from the scenarios outlined.

I would appreciate if someone could explain to me how Google calculates it, or if there is a section of the documentation where I can find this since I haven't seen it.

dawnberdan

Hi @diegol116,

Welcome to Google Cloud Community!

Vertex AI Generative AI quotas are calculated based on the number of requests per minute (RPM) for a base model and all its versions, identifiers, and tuned versions. Unfortunately, Google doesn't publicly disclose the exact algorithm used to calculate these limits. The quotas apply to requests for a given Google Cloud project and supported region. Additionally, there are quotas for specific services like RAG Engine and Gen AI Evaluation Service. Some quotas are shared across all applications and IP addresses within a Google Cloud project.

I hope the above information is helpful.

How Vertex AI rate limits are calculated on GCP?