Re: Issue with LLM-specific metrics in Google Clou...

pedropcamellon · 08-27-2024 08:14 AM

I'm encountering an error when trying to use LLM-specific metrics like "groundedness" or "coherence" with the Google Cloud Vertex AI evaluation framework for the Gemini model. Standard metrics like ROUGE and BLEU work fine, but I specifically need the LLM-related metrics. I'm following the example below:

"prompt_engineering_evaluation_rapid_evaluation_sdk.ipynb"

Source: <https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/prompt_engineering_...>

## Environment

- Google Cloud Vertex AI

- `vertexai` SDK (not the older `aiplatform`)

- Python version: 3.10.12

- `google-cloud-aiplatform` version: 1.63.0

## Code snippet

```python

import vertexai

from vertexai.preview.generative_models import GenerativeModel

from vertexai.preview.language_models import TextEmbeddingModel

from google.cloud import aiplatform

# ...

PROJECT_ID = "pa*****413" # @PARAM {type:"string"}

LOCATION = "us-central1" # @PARAM {type:"string"}

vertexai.init(project=PROJECT_ID, location=LOCATION)

instruction = "Summarize the following article"

context = [

"To make a classic spaghetti carbonara, start by bringing a large pot of salted water to a boil. While the water is heating up, cook pancetta or guanciale in a skillet with olive oil over medium heat until it's crispy and golden brown. Once the pancetta is done, remove it from the skillet and set it aside. In the same skillet, whisk together eggs, grated Parmesan cheese, and black pepper to make the sauce. When the pasta is cooked al dente, drain it and immediately toss it in the skillet with the egg mixture, adding a splash of the pasta cooking water to create a creamy sauce.",

]

reference = [

"The process of making spaghetti carbonara involves boiling pasta, crisping pancetta or guanciale, whisking together eggs and Parmesan cheese, and tossing everything together to create a creamy sauce.",

]

eval_dataset = pd.DataFrame(

{

"context": context,

"instruction": [instruction] * len(context),

"reference": reference,

"prompt": [instruction] * len(context)

}

)

prompt_templates = [

"Instruction: {instruction}. Article: {context}. Summary:",

"Article: {context}. Complete this task: {instruction}, in one sentence. Summary:",

"Goal: {instruction} and give me a TLDR. Here's an article: {context}. Summary:",

]

metrics = [

"rouge_1",

"bleu",

"coherence",

"groundedness",

]

eval_task = EvalTask(

dataset=eval_dataset,

metrics=metrics,

experiment="test-eval"

)

gemini_model = GenerativeModel("gemini-pro")

experiment_name = "eval-sdk-prompt-engineering" # @PARAM {type:"string"}

summarization_eval_task = EvalTask(

dataset=eval_dataset,

metrics=metrics,

experiment=experiment_name,

)

run_id = generate_uuid()

eval_results = []

for i, prompt_template in tqdm(

enumerate(prompt_templates), total=len(prompt_templates)

😞

eval_result = summarization_eval_task.evaluate(

prompt_template=prompt_template,

model=gemini_model,

)

eval_results.append(

(f"Prompt #{i}", eval_result.summary_metrics, eval_result.metrics_table)

)

```

## Error message

```

Error: 400 Failed to make prediction request. If you're using a new project, expect a delay and retry in a few minutes. Error: /PredictionServiceV1.GenerateContent to [2002:a05:6681:452a::]:4339 : APP_ERROR(3) User has requested a restricted HarmBlockThreshold setting BLOCK_NONE. You can get access either (a) through an allowlist via your Google account team, or (b) by switching your account type to monthly invoiced billing via this instruction: https://cloud.google.com/billing/docs/how-to/invoiced-billing.

```

## What I've tried

1. Changing metrics to standard NLP metrics like ROUGE and BLEU (works, but not what I need)

2. Explicitly setting and removing safety settings for the GenerativeModel

```python

safety_settings = {

HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,

HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,

HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,

HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,

HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,

}

model = GenerativeModel("gemini-pro", safety_settings=safety_settings)

```

3. Updating dependencies and checking API quotas

4. Verifying project configuration and permissions (tried with both personal owner account and company account)

5. Using different LLM-specific metrics (e.g., changing from "groundedness" to "coherence")

## Questions

1. Has anyone successfully used LLM-specific metrics like "groundedness" or "coherence" with the Vertex AI evaluation framework for the Gemini model?

2. Are there specific permissions, project settings, or API enablements required for these advanced metrics?

3. Is there a beta program or allowlist for using these LLM-specific metrics with the Gemini model?

4. Are there any known issues or limitations with these metrics in the current version of the Vertex AI SDK?

Any insights or solutions would be greatly appreciated!

Qasim_Taleb

Hello
in order to use Gimini as demo.I expect good way is google studio.

pedropcamellon

Hello, and thank you for your suggestion!

I appreciate your idea about using Google AI Studio for a Gemini demo. However, I want to clarify that my post is about a real application use case, not just a demo. While I can't share internal examples due to confidentiality, the functions and flow I described are similar to what I'm using in production.

My main issue is with using specific evaluation metrics like "groundedness" in the Vertex AI evaluation framework. Do you happen to know what permissions might be needed to use these LLM-specific metrics? I'm encountering errors related to HarmBlockThreshold settings.

Regarding Google AI Studio, while it's great for experimentation and demos, it's not typically recommended for production use cases like mine. I need more robust tools and integration capabilities that Vertex AI provides.

That said, I genuinely appreciate you taking the time to read my post and offer a suggestion. It's always helpful to consider different approaches. If you have any insights on the permissions issue or experience with LLM-specific metrics in Vertex AI, I'd be very interested to hear about it.

Thanks again, and have a great day!

Issue with LLM-specific metrics in Google Cloud Vertex AI Evaluation Framework