Vertex AI Vector Search Index missing/dropping som...

eddieleejw · 07-20-2024 12:28 AM

I am trying to embed a collection of documents to the vector search index on Vertex AI. The issue is that whenever I embed many documents at once, a subset of them are not embedded. I can not figure out why there are not embedded, nor which documents have been skipped. This issue occurs consistently. When I embed ~35 documents, around 7 get dropped. When I embed ~150 documents, around 15 get dropped.

I have created a GCS staging bucket, a Vertex AI index, endpoint, and have deployed that endpoint to that index. I created the index like so:

```

index = aiplatform.MatchingEngineIndex.create_tree_ah_index(

display_name="my_index",

dimensions=DIMENSIONS,

approximate_neighbors_count=150,

leaf_node_embedding_count=500,

leaf_nodes_to_search_percent=7,

description="My index",

)

```

I create a vector store like so:

```

vectorstore = VectorSearchVectorStore.from_components(

project_id=PROJECT_ID,

region=LOCATION,

gcs_bucket_name=GCS_BUCKET,

index_id=index_id,

endpoint_id=endpoint_id,

embedding=VertexAIEmbeddings(model_name="textembedding-gecko@003"),

)

```

I create a retriever like so:

```

retriever_multi_vector_img = MultiVectorRetriever(

vectorstore=vectorstore,

docstore=docstore,

id_key = id_key

)

```

and add the documents like so:

```

retriever_multi_vector_img.vectorstore.add_documents(summary_documents)

```

where `summary_documents` is a list of `langchain_core.documents.Document` objects.

I tried this again with an empty GCS bucket, and tried embedding 145 documents. After doing so, the index reports a dense vector count of 132. Checking the staging bucket, I can see that all 145 documents are in the bucket. But when I check the generated input file called "documents.json", I see that it only contains 132 entries, corresponding to the dense vector count of 132.

Could anyone help me figure out why this is happening? Thank you!

jaia

Hello,

Thank you for contacting Google Cloud Community!

I belive you are encountering inconsistent results when embedding documents into a Vertex AI vector search index, with a subset of documents not being embedded despite being present in the GCS bucket and the generated documents.json file.

Ensure your code includes proper error handling mechanisms to catch exceptions during the embedding process. Log detailed error messages for failed embeddings.
Verify that you're not hitting Vertex AI API rate limits. If you are, consider implementing exponential backoff or retry logic in your code.
Verify that the content in your documents is compatible with appropriate model. Some content types or languages might not be well-supported.
Verify that the data in the documents.json file matches the original documents in your GCS bucket. Check for any discrepancies or errors in the data preparation process.

Regards,
Jai Ade

jaia

Hello,

Thank you for your engagement regarding this issue. We haven’t heard back from you regarding this issue for sometime now. Hence, I'm going to close this issue which will no longer be monitored. However, if you have any new issues, Please don’t hesitate to create a new issue. We will be happy to assist you on the same.

Regards,
Jai Ade

Vertex AI Vector Search Index missing/dropping some documents