I have been using the "gemini-1.5-flash-002" model with Vertex AI to generate content for the past few weeks. While it works well initially, it occasionally pauses unexpectedly after processing a certain number of requests.
I attempted to identify a pattern, such as the number of requests or the time elapsed before the pauses occur, but no consistent trend emerged. Sometimes, the model handles around 1,500 requests without issues, while other times, it pauses after approximately 100 requests.
The variation in the number of input tokens between requests is minimal, as the input data is relatively consistent in length.
When the pause occurs, it lasts for about 10 minutes before throwing the following error:
```
import base64
import json
from datetime import datetime
import pytz
import vertexai
from google.oauth2 import \
service_account # importing auth using service_account
from vertexai.generative_models import GenerationConfig, GenerativeModel
indian_tz = pytz.timezone("Asia/Kolkata")
cred_in_base64_encoding = "base64-encoded-google-app-credentials"
google_app_creds = json.loads(base64.b64decode(cred_in_base64_encoding).decode("utf-8"))
credentials = service_account.Credentials.from_service_account_info(
google_app_creds, scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
# Initialize Vertex AI
vertexai.init(
project=google_app_creds["project_id"],
location="europe-west3",
credentials=credentials,
)
model_name = "gemini-1.5-flash-002"
model = GenerativeModel(model_name)
def context_based_match(input_text):
llm_query = f"LLM prompt with {input_text}"
# Your response should be in this JSON-format: {{"relevant": boolean}}
response = model.generate_content(
llm_query,
generation_config=GenerationConfig(
response_mime_type="application/json",
max_output_tokens=32,
temperature=0,
seed=1102,
response_schema={
"type": "object",
"properties": {
"relevant": {
"type": "boolean",
}
},
"required": ["relevant"],
},
),
)
try:
return json.loads(response.text)
except Exception as e:
print(f"Error: {e}")
return {"relevant": None}
data = list() # List of Texts
for i, text in enumerate(data, 1):
output = context_based_match(text)
print(f"{i} | {datetime.now(indian_tz)} | {output['relevant']} | {text}")
I couldn't find anything related to the issue in any google documentations or on internet. Even "Quotas & System Limits" page doesn't show any usage statistics of "gemini-1.5-flash-002" model. I can only see "Online prediction requests per minute per region" statistics which is given below.
Any insights would be appreciated.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |