Solved: Re: Vertex AI API quotas - in the real world

mixart · 07-13-2024 09:36 PM

Working on an app that uses Vertex AI API, the default quote looks like it's 5 requests per minute, which I understand is a good starting point for development but how big can the quotas get for a real-world app?

I imagine any app conservative successful app with say 1000 concurrent users might need something like a few thousand requests a minute. What if that app scales to 10K concurrent users, would Vertex support this type of traffic?

ruthseki

Hi @mixart,

Welcome to Google Cloud Community!

Considering the growth of an app, it’s prudent to evaluate scaling the Vertex AI API usage. Here's a breakdown of how Vertex AI handles quotas and how to plan for scaling:

The default 5 requests per minute is designed to get you started and explore the API.

However, you have the option to scale your quotas by contacting Google Cloud Support. For production apps, you'll need to discuss your expected traffic patterns and desired throughput with Google Cloud Support. They can:

Analyze your specific use case.
Provide custom quota configurations.
Help you design your app to optimize performance and avoid potential bottlenecks.

Vertex AI can handle high throughput. With its scalable infrastructure, it is designed to handle massive amounts of data and requests.

It is designed for autoscaling. You can configure Vertex AI models to automatically scale based on demand. This means your app can handle sudden spikes in traffic without compromising performance.

With Vertex AI, you can deploy your models in multiple regions to distribute traffic and minimize latency.

Here are some key considerations for scaling:

Cost: Higher quotas will translate to increased costs, so consider cost-optimization strategies.

Model Optimization: Optimize your models for performance. This can involve:

Model compression (e.g., pruning)
Faster inference algorithms
Model caching

Traffic Management: Implement traffic management techniques like:

Load balancing
Queuing
Rate limiting

In addition, for 10K concurrent users, while it's impossible to give a precise quota without knowing your specific use case, here's some general guidance:

Request Frequency: 10K concurrent users might translate to thousands of requests per minute, but the exact number depends on the complexity of your requests.

Vertex AI is Capable: Vertex AI is capable of handling this kind of traffic, but you'll likely need to work with Google Cloud Support to ensure appropriate quotas and optimization strategies.

Here are additional reminders when getting started with scaling:

Contact Support Early: Don't wait until you're dealing with production-scale traffic. Reach out to Google Cloud Support early in your development process.

Build for Scalability: Design your app with scalability in mind from the start.

Monitoring and Optimization: Implement robust monitoring to track performance and identify bottlenecks.

I hope the above information is helpful.

View solution in original post

ruthseki