Quota exceeded for quota metric 'LLM utility reque...

vivekbhat · 05-21-2024 09:33 PM

Hi There, I am new to goole cloud , I am trying to access VertexAiEmbeddingModel with model : textembedding-gecko but getting following error :

com.google.api.gax.rpc.ResourceExhaustedException: io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Quota exceeded for quota metric 'LLM utility requests' and limit 'LLM utility requests per minute per region' of service 'aiplatform.googleapis.com' for consumer 'project_number:732984556506'.] with root cause
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Quota exceeded for quota metric 'LLM utility requests' and limit 'LLM utility requests per minute per region' of service 'aiplatform.googleapis.com' for consumer 'project_number:732984556506'.

Any help will be appreciated

kolzeq

same issue, and not having found how to improve the quota value

Yravikrishna

You get this error when you are sending too many requests in a short period of time. If you use Free account, then limit the data that you are sending. Alternatively, you send the data in batches with sleep function in a loop.

Poala_Tenorio

he error message you are seeing, Dataset validation failed: {consecutive_turns: [73]}, indicates that there is an issue with consecutive messages or turns in your dataset at line 73. This error is often related to formatting issues in the JSONL file.

The error you're encountering indicates that your usage of the Vertex AI embedding model has exceeded the allocated quota for requests per minute for your Google Cloud project. Here are some steps you can take to address this issue:

Check Quota Limits:

Go to the Google Cloud Console.
Navigate to the "IAM & Admin" section and select "Quotas".
Filter for "aiplatform.googleapis.com" and check the quotas related to "LLM utility requests".
Request Quota Increase:

If you frequently exceed the quota, you can request an increase.
In the Quotas page, select the quota you are exceeding and click on "EDIT QUOTAS".
Follow the steps to submit a request for a higher quota. Note that this process may take some time and is subject to approval by Google Cloud.
Optimize Requests:

Ensure that your application is efficiently using the Vertex AI service. This might include batching requests where possible to reduce the number of API calls.
Implement retry logic with exponential backoff to handle transient quota exceedances gracefully.
Distribute Requests:

If your application can run in multiple regions, you might consider distributing the requests across different regions to balance the load and reduce the chance of hitting quota limits in a single region.
Monitor Usage:

Regularly monitor your usage to identify patterns that lead to quota exceedances.
Set up alerts in the Google Cloud Console to notify you when your usage approaches quota limits.
Here is an example of how to implement exponential backoff in Java using the Retryer class from the com.github.rholder library:

import com.github.rholder.retry.*;
import java.util.concurrent.Callable;
import java.util.concurrent.TimeUnit;

public class VertexAiClient {
private static final int MAX_ATTEMPTS = 5;
private static final long INITIAL_INTERVAL = 1000; // 1 second
private static final long MAX_INTERVAL = 5000; // 5 seconds

public static void main(String[] args) {
Retryer<Boolean> retryer = RetryerBuilder.<Boolean>newBuilder()
.retryIfExceptionOfType(ResourceExhaustedException.class)
.withWaitStrategy(WaitStrategies.exponentialWait(MAX_INTERVAL, TimeUnit.MILLISECONDS))
.withStopStrategy(StopStrategies.stopAfterAttempt(MAX_ATTEMPTS))
.build();

try {
retryer.call(new Callable<Boolean>() {
@Override
public Boolean call() throws Exception {
// Call your Vertex AI API here
return callVertexAiApi();
}
});
} catch (RetryException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
}

private static Boolean callVertexAiApi() throws ResourceExhaustedException {
// Simulate the API call and handle the RESOURCE_EXHAUSTED exception
// Replace this with your actual API call
throw new ResourceExhaustedException("Quota exceeded");
}

// Define your custom ResourceExhaustedException
static class ResourceExhaustedException extends Exception {
public ResourceExhaustedException(String message) {
super(message);
}
}
}

import com.github.rholder.retry.*;import java.util.concurrent.Callable;import java.util.concurrent.TimeUnit; public class VertexAiClient {private static final int MAX_ATTEMPTS = 5;private static final long INITIAL_INTERVAL = 1000; // 1 secondprivate static final long MAX_INTERVAL = 5000; // 5 seconds public static void main(String[] args) {Retryer<Boolean> retryer = RetryerBuilder.<Boolean>newBuilder().retryIfExceptionOfType(ResourceExhaustedException.class).withWaitStrategy(WaitStrategies.exponentialWait(MAX_INTERVAL, TimeUnit.MILLISECONDS)).withStopStrategy(StopStrategies.stopAfterAttempt(MAX_ATTEMPTS)).build(); try {retryer.call(new Callable<Boolean>() {@Overridepublic Boolean call() throws Exception {// Call your Vertex AI API herereturn callVertexAiApi();}});} catch (RetryException e) {e.printStackTrace();} catch (ExecutionException e) {e.printStackTrace();}} private static Boolean callVertexAiApi() throws ResourceExhaustedException {// Simulate the API call and handle the RESOURCE_EXHAUSTED exception// Replace this with your actual API callthrow new ResourceExhaustedException("Quota exceeded");} // Define your custom ResourceExhaustedExceptionstatic class ResourceExhaustedException extends Exception {public ResourceExhaustedException(String message) {super(message);}}}

By implementing exponential backoff, you can manage retries in a controlled manner and reduce the chance of quickly hitting quota limits again.

Following these steps should help you mitigate and manage the quota exceeded errors you are encountering with Vertex AI on Google Cloud.