Hi GCP Community,
I’m trying to perform text embedding using the task type RETRIEVAL_DOCUMENT in batch prediction. However, no matter which approach I use, the embeddings I obtain are the same as those generated by the default RETRIEVAL_QUERY task type. This seems consistent with the documentation, which states that the default embedding task type is RETRIEVAL_QUERY.
I’d like to know if there’s a way to configure my batch embeddings to specifically use task_type="RETRIEVAL_DOCUMENT". Below are the two methods I’ve tried:
Attempt 1:
from vertexai.preview.language_models import TextEmbeddingModel
# Input JSONL format
{
"content": "sample text",
"task_type": "RETRIEVAL_DOCUMENT"
}
text_embedding_model = TextEmbeddingModel.from_pretrained("text-multilingual-embedding-002")
batch_prediction_job = text_embedding_model.batch_predict(
dataset="gs://bucket-name/input.jsonl",
destination_uri_prefix="gs://bucket-name/output/"
)
Attempt 2:
from vertexai.preview.language_models import TextEmbeddingModel
# Input JSONL format
{
"content": "sample text"
}
text_embedding_model = TextEmbeddingModel.from_pretrained("text-multilingual-embedding-002")
batch_prediction_job = text_embedding_model.batch_predict(
dataset="gs://bucket-name/input.jsonl",
destination_uri_prefix="gs://bucket-name/output/",
model_parameters={"task_type": "RETRIEVAL_DOCUMENT"},
)
For comparison, here’s how I perform streaming text embedding, which successfully uses task_type="RETRIEVAL_QUERY":
Streaming Code:
def embed_text(
texts: List[str] = ["banana muffins?", "banana bread? banana muffins?"],
task: str = "RETRIEVAL_QUERY",
model_name: str = "text-multilingual-embedding-002",
) -> List[List[float]]:
"""Embeds texts with a pre-trained, foundational model."""
model = TextEmbeddingModel.from_pretrained(model_name)
inputs = [TextEmbeddingInput(text, task) for text in texts]
embeddings = model.get_embeddings(inputs)
return [embedding.values for embedding in embeddings]
I’d greatly appreciate any guidance on:
Whether it’s possible to explicitly specify task_type="RETRIEVAL_DOCUMENT" for batch prediction.
If so, what changes should I make to my code or configuration?
Thank you for your help!
reference documents:
https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api
https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/batch-prediction-genai-embeddings
Hi @tzuting,
Welcome to Google Cloud Community!
The task_type parameter guides the Text Embedding model on how to understand and handle your input text, determining the type of embedding produced. Different task_type values generate different embedding vectors, each optimized for specific downstream tasks.
How to Use task_type:
1. Streaming Mode: You can specify task_type for each individual text embedding using the TextEmbeddingInput class in streaming mode.
2. Batch Prediction: As of now, you cannot directly specify task_type for batch predictions. The default RETRIEVAL_QUERY task type is applied to all texts in the batch.
Here are some resources that may help you:
In addition, you can also check this Github Repository as a baseline for your troubleshooting.
I hope the above information is helpful.
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |