I'm encountering an error when trying to use LLM-specific metrics like "groundedness" or "coherence" with the Google Cloud Vertex AI evaluation framework for the Gemini model. Standard metrics like ROUGE and BLEU work fine, but I specifically need the LLM-related metrics. I'm following the example below:
"prompt_engineering_evaluation_rapid_evaluation_sdk.ipynb"
## Environment
- Google Cloud Vertex AI
- `vertexai` SDK (not the older `aiplatform`)
- Python version: 3.10.12
- `google-cloud-aiplatform` version: 1.63.0
## Code snippet
```python
import vertexai
from vertexai.preview.generative_models import GenerativeModel
from vertexai.preview.language_models import TextEmbeddingModel
from google.cloud import aiplatform
# ...
PROJECT_ID = "pa*****413" # @PARAM {type:"string"} LOCATION = "us-central1" # @PARAM {type:"string"} vertexai.init(project=PROJECT_ID, location=LOCATION)
instruction = "Summarize the following article"
context = [
"To make a classic spaghetti carbonara, start by bringing a large pot of salted water to a boil. While the water is heating up, cook pancetta or guanciale in a skillet with olive oil over medium heat until it's crispy and golden brown. Once the pancetta is done, remove it from the skillet and set it aside. In the same skillet, whisk together eggs, grated Parmesan cheese, and black pepper to make the sauce. When the pasta is cooked al dente, drain it and immediately toss it in the skillet with the egg mixture, adding a splash of the pasta cooking water to create a creamy sauce.",
]
reference = [
"The process of making spaghetti carbonara involves boiling pasta, crisping pancetta or guanciale, whisking together eggs and Parmesan cheese, and tossing everything together to create a creamy sauce.",
]
eval_dataset = pd.DataFrame(
{
"context": context,
"instruction": [instruction] * len(context),
"reference": reference,
"prompt": [instruction] * len(context)
}
)
prompt_templates = [
"Instruction: {instruction}. Article: {context}. Summary:",
"Article: {context}. Complete this task: {instruction}, in one sentence. Summary:",
"Goal: {instruction} and give me a TLDR. Here's an article: {context}. Summary:",
]
metrics = [
"rouge_1",
"bleu",
"coherence",
"groundedness",
]
eval_task = EvalTask(
dataset=eval_dataset,
metrics=metrics,
experiment="test-eval"
)
gemini_model = GenerativeModel("gemini-pro")
experiment_name = "eval-sdk-prompt-engineering" # @PARAM {type:"string"} summarization_eval_task = EvalTask(
dataset=eval_dataset,
metrics=metrics,
experiment=experiment_name,
)
run_id = generate_uuid()
eval_results = []
for i, prompt_template in tqdm(
enumerate(prompt_templates), total=len(prompt_templates)
😞
eval_result = summarization_eval_task.evaluate(
prompt_template=prompt_template,
model=gemini_model,
)
eval_results.append(
(f"Prompt #{i}", eval_result.summary_metrics, eval_result.metrics_table)
)
```
## Error message
```
Error: 400 Failed to make prediction request. If you're using a new project, expect a delay and retry in a few minutes. Error: /PredictionServiceV1.GenerateContent to [2002:a05:6681:452a::]:4339 : APP_ERROR(3) User has requested a restricted HarmBlockThreshold setting BLOCK_NONE. You can get access either (a) through an allowlist via your Google account team, or (b) by switching your account type to monthly invoiced billing via this instruction: https://cloud.google.com/billing/docs/how-to/invoiced-billing. ```
## What I've tried
1. Changing metrics to standard NLP metrics like ROUGE and BLEU (works, but not what I need)
2. Explicitly setting and removing safety settings for the GenerativeModel
```python
safety_settings = {
HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
}
model = GenerativeModel("gemini-pro", safety_settings=safety_settings)
```
3. Updating dependencies and checking API quotas
4. Verifying project configuration and permissions (tried with both personal owner account and company account)
5. Using different LLM-specific metrics (e.g., changing from "groundedness" to "coherence")
## Questions
1. Has anyone successfully used LLM-specific metrics like "groundedness" or "coherence" with the Vertex AI evaluation framework for the Gemini model?
2. Are there specific permissions, project settings, or API enablements required for these advanced metrics?
3. Is there a beta program or allowlist for using these LLM-specific metrics with the Gemini model?
4. Are there any known issues or limitations with these metrics in the current version of the Vertex AI SDK?
Any insights or solutions would be greatly appreciated!