Re: Which quota needs to be adjusted for executing...

TheBigMac · 04-02-2025 07:08 AM

I'm not able to find any Vertex AI quota that seems to have any usage on it, but I'm getting quota exceeded errors.

The error I'm getting is:
Statement failed: Vertex AI endpoint projects/alt-staging/locations/us-east4/publishers/google/models/text-embedding-004 quota has been exceeded. Please see Vertex AI error log, quota dashboard and https://cloud.google.com/vertex-ai/docs/quotas for details.

The code being executed is:
query = """
SELECT embeddings.values
FROM ML.PREDICT(MODEL EmbeddingsModel,
(SELECT CONCAT(COALESCE(@title, ''), ' ', COALESCE(@description, '')) AS content))
"""
params = {"title": title, "description": description}
param_types = {
"title": spanner.param_types.STRING,
"description": spanner.param_types.STRING,
}

with database.snapshot() as snapshot:
result = snapshot.execute_sql(query, params=params, param_types=param_types)
for row in result:
return row[0]
except Exception as e:
logger.error(f"Error fetching embeddings: {e}")
return None

Can somebody help me understand what quota I need to adjust?

Thanks!

sylvanas_beta

Hello TheBigMac! 😀

The error suggests you're hitting quota limits with Vertex AI's text-embedding-004 model. This issue is related to rate limits when using Vertex AI for generating text embeddings through BigQuery ML.

From limited research I have found your issue is more likely tied to:
Vertex AI Online Prediction Requests per Minute &
Vertex AI Model text-embedding-004 Requests per Minute

Whilst in the GCP project you are hitting the quota limit, In GCP in the quota and system limits section in the GCP console, search for the following quota. Online prediction requests per minute per region

Another possible way to identify the quota or quotas you need to request an increase can be done by following the following guide.

1. Whilst in the GCP project you are hitting the quota limit, navigate to IAM and admin - Quotas and system limits.
2. The specific quota you need to check is for Vertex AI API requests in the us-east4 region with the text-embedding-004 model.
3. Select "Dimensions" and filter for "location:us-east4" 4. Optional: Add another filter for "metric" containing "TextEmbeddings" or "text-embedding"

Look for quota items related to online prediction requests per minute, API calls per minute for text-embedding-004 and requests per minute for embedding models.

Ensure to refine your scope to us-east4 region.

Let me know how you get on, if you need help increasing the quotas once you identify them I will send you the info you will need to send if you need to make a formal request outside the GCP console.

Jai

TheBigMac

So the big problem is identifying which quota I need to adjust. This was for Spanner, not BQ, not that I think that matters.

Here's what we have for the quota you suggested:

This quota doesn't seem to be specific to any model, just in general. I can't imagine we're making more than 600 requests per minute, as this is only being used in our DEV environment presently.

Also, I'm not seeing ANY quotas related to anything in Vertex AI that show ANY usage whatsoever. Is there a delay in seeing this? If so, how long?

For reference, here are some quotas we have in us-central1, where the ML.PREDICT works successfully:

And those same ones in us-east4:

Thanks for your help!

Mac

sylvanas_beta

Hey Mac,

No worries, and apologies, I should have read the title more thoroughly, I did see spanner in your query but wasn't sure if that was causing the error.

Thanks for sharing the screenshots and additional info, I would then think about troubleshooting from the beginning.

The first thing that jumps out from the screenshots is that in us-central1 you have a quota for base model: text-embedding-large-001 in us-central1 of 360, for the same in us-east4 you have a quota of 0. Could the error message be generic and you need that specific quota increased for an element of your workflow? Having both projects the same will definitely help in ruling out possible quota issues.

Also is this system running in the same project, are the two systems identical, etc and or whats different between the two?
I have limited production exp with spanner however, from your response and doing a little more digging, for ML.PREDICT operations specifically, each prediction consumes Spanner query units. When using remote models (like Vertex AI text-embedding-004), predictions count against both Spanner quotas and Vertex AI quotas and
ML.PREDICT queries are more resource-intensive than standard queries.

To rule out Spanner as a cause of the quota resource limit error, try the following, within the quotas and system limits tab, filter for "spanner.googleapis.com" in the same project and look for quotas related to queries per minute and instance compute units.

If you haven't already, look at the logs in Logs Explorer narrowing the window to when you see the errors and look for anything unusual, what you could also do is compare logs from the working execution to the execution thats failing and look for possible insights from the general flow of operations.

I would also dig deeper into the quota usage to see whats being used in both projects also having a deeper log in the logs to see what API's are being called to uncover which APIs and services are being used.

Quota limits should reflect in real time (average 5-15 mins) but there can be delays depending on the API/service and how busy GCP is at the time

Let me know how you get on.

TheBigMac

The first thing that jumps out from the screenshots is that in us-central1 you have a quota for base model: text-embedding-large-001 in us-central1 of 360, for the same in us-east4 you have a quota of 0. Could the error message be generic and you need that specific quota increased for an element of your workflow? Having both projects the same will definitely help in ruling out possible quota issues.

>>> The model we're referencing is text-embedding-004, not the -large variant but I'm honestly not sure if the two relate. I don't see any quotas just for "text-embedding-004". There is only 1 project, but we're specifying the region for model usage, and when we specify us-central1, our code works, and when use we us-east4 or 5, we get the error.

I checked for spanner.googleapis.com and found nothing that seems to apply. Nothing that even mentions Query at all. I even used the CLI to dump all Spanner quotas to a file then searched/grep'd to find anything related to compute or query.

I've spent a bunch of time in the quota page, trying different searches and the only one that I found that was even close was:

But it's at 90 per minute. I think this is failing for us on our first call, indicating that the quota for whatever we need might be at 0?

I also went down the metrics explorer and the logs explorer, but when I try to filter by resource "aiplatform.googleapis.com" there are no options to choose from.

I have put in 2 quota increase requests for those two that were different between us-central1 and us-east4.

Thanks again!

Mac

Which quota needs to be adjusted for executing Spanner ML.PREDICT operations?