Similarity search with Spanner

annalieb · 08-01-2024 11:18 AM

Hi all,

I am currently using Spanner as a vector database. I get similarity search results using the ORDER BY COSINE_DISTANCE() function to find the most similar vector embeddings to the search query.

As I scale up my vector database to ~2 million embeddings, these queries take multiple minutes to execute. I have looked into the new vector indexing feature to get APPROX_COSINE_DISTANCE() results, but I cannot use the vector index feature because it is in preview and I have less than 1 node (ie. granular instance).

I would love to hear thoughts on either: (1) Do you know how long will it take for the VECTOR INDEX feature to become available to all Spanner users (General Availability status instead of Preview)? (2) Do you know any other ways to speed up this search, other than reducing vector dimensions?

Any ideas welcome. Thanks!

kshenoy

ANN is not available on the granular instances at the moment and minimum of 1 node needed. This is likely to change in near future so please stay tuned and thanks for your patience.

annalieb

Thanks for your response @kshenoy ! I appreciate the Google Cloud team's work on expanding ANN availability. I was hoping this was the case, since these features are pretty new. In the meantime I'll work on my own solution to help reduce load times.

Chow

Just a quick update on this thread. A couple of things to note:

Granular instances are now supported. Please give it a shot and let me know if you're hitting any issues.
ANN GA will happen soon.
One other thing you could try to improve performance is annotating your embedding column with vector_length. There are optimizations that we've built that kick in for columns that are annotated with vector_length that may give you a useful boost.

Relevant docs: https://cloud.google.com/spanner/docs/reference/standard-sql/data-definition-language (search for "vector_length")