Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Vertex AI Vector Search Index missing/dropping some documents

I am trying to embed a collection of documents to the vector search index on Vertex AI.  The issue is that whenever I embed many documents at once, a subset of them are not embedded. I can not figure out why there are not embedded, nor which documents have been skipped. This issue occurs consistently. When I embed ~35 documents, around 7 get dropped. When I embed ~150 documents, around 15 get dropped.

I have created a GCS staging bucket, a Vertex AI index, endpoint, and have deployed that endpoint to that index. I created the index like so:

```

index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
display_name="my_index",
dimensions=DIMENSIONS,
approximate_neighbors_count=150,
leaf_node_embedding_count=500,
leaf_nodes_to_search_percent=7,
description="My index",
)
```

I create a vector store like so:

```

vectorstore = VectorSearchVectorStore.from_components(
project_id=PROJECT_ID,
region=LOCATION,
gcs_bucket_name=GCS_BUCKET,
index_id=index_id, 
endpoint_id=endpoint_id, 
embedding=VertexAIEmbeddings(model_name="textembedding-gecko@003"),
)
```
 
I create a retriever like so:
```
retriever_multi_vector_img = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=docstore,
id_key = id_key
)
```
 
and add the documents like so:
```
retriever_multi_vector_img.vectorstore.add_documents(summary_documents)
```
 
where `summary_documents` is a list of `langchain_core.documents.Document` objects.
 
I tried this again with an empty GCS bucket, and tried embedding 145 documents. After doing so, the index reports a dense vector count of 132. Checking the staging bucket, I can see that all 145 documents are in the bucket. But when I check the generated input file called "documents.json", I see that it only contains 132 entries, corresponding to the dense vector count of 132.
 
Could anyone help me figure out why this is happening? Thank you!
0 2 1,019
2 REPLIES 2