How to Use task_type="RETRIEVAL_DOCUMENT" for Batc...

tzuting · 11-26-2024 02:32 AM

Hi GCP Community,

I’m trying to perform text embedding using the task type RETRIEVAL_DOCUMENT in batch prediction. However, no matter which approach I use, the embeddings I obtain are the same as those generated by the default RETRIEVAL_QUERY task type. This seems consistent with the documentation, which states that the default embedding task type is RETRIEVAL_QUERY.

I’d like to know if there’s a way to configure my batch embeddings to specifically use task_type="RETRIEVAL_DOCUMENT". Below are the two methods I’ve tried:

Attempt 1:

from vertexai.preview.language_models import TextEmbeddingModel

# Input JSONL format
{
"content": "sample text",
"task_type": "RETRIEVAL_DOCUMENT"
}

text_embedding_model = TextEmbeddingModel.from_pretrained("text-multilingual-embedding-002")
batch_prediction_job = text_embedding_model.batch_predict(
dataset="gs://bucket-name/input.jsonl",
destination_uri_prefix="gs://bucket-name/output/"
)

Attempt 2:

from vertexai.preview.language_models import TextEmbeddingModel

# Input JSONL format
{
"content": "sample text"
}

text_embedding_model = TextEmbeddingModel.from_pretrained("text-multilingual-embedding-002")
batch_prediction_job = text_embedding_model.batch_predict(
dataset="gs://bucket-name/input.jsonl",
destination_uri_prefix="gs://bucket-name/output/",
model_parameters={"task_type": "RETRIEVAL_DOCUMENT"},
)

For comparison, here’s how I perform streaming text embedding, which successfully uses task_type="RETRIEVAL_QUERY":

Streaming Code:

def embed_text(
texts: List[str] = ["banana muffins?", "banana bread? banana muffins?"],
task: str = "RETRIEVAL_QUERY",
model_name: str = "text-multilingual-embedding-002",
) -> List[List[float]]:
"""Embeds texts with a pre-trained, foundational model."""
model = TextEmbeddingModel.from_pretrained(model_name)
inputs = [TextEmbeddingInput(text, task) for text in texts]
embeddings = model.get_embeddings(inputs)
return [embedding.values for embedding in embeddings]

I’d greatly appreciate any guidance on:
Whether it’s possible to explicitly specify task_type="RETRIEVAL_DOCUMENT" for batch prediction.
If so, what changes should I make to my code or configuration?
Thank you for your help!

reference documents:
https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api
https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/batch-prediction-genai-embeddings

dawnberdan

Hi @tzuting,

Welcome to Google Cloud Community!

The task_type parameter guides the Text Embedding model on how to understand and handle your input text, determining the type of embedding produced. Different task_type values generate different embedding vectors, each optimized for specific downstream tasks.

RETRIEVAL_QUERY: This is the default task type. It generates embeddings suitable for tasks like semantic search, where you want to find documents similar to a given query.
RETRIEVAL_DOCUMENT: This task type generates embeddings optimized for representing entire documents. It's useful for tasks like document clustering or finding similar documents.
CLASSIFICATION: This task type generates embeddings for text classification tasks, where you want to categorize text into predefined classes.

How to Use task_type:

1. Streaming Mode: You can specify task_type for each individual text embedding using the TextEmbeddingInput class in streaming mode.

2. Batch Prediction: As of now, you cannot directly specify task_type for batch predictions. The default RETRIEVAL_QUERY task type is applied to all texts in the batch.

Here are some resources that may help you:

Get Text Embeddings – A guide on how to obtain text embeddings using Vertex AI.
Get Batch Predictions and Explanations – A guide on obtaining batch predictions and explanations for tabular datasets.
Get Batch Predictions from a Custom Trained Model – A guide on using a custom-trained model for batch predictions.

In addition, you can also check this Github Repository as a baseline for your troubleshooting.

I hope the above information is helpful.

How to Use task_type="RETRIEVAL_DOCUMENT" for Batch Text Embedding in Vertex AI?