Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Using BigQuery Embeddings in a RAG Architecture

Hi everyone,

I am working on building a Q&A system using the Gemini model in a RAG (Retrieval-Augmented Generation) architecture, leveraging BigQuery embeddings. However, I have multiple datasets and tables in BigQuery, as well as multiple kinds in Datastore.

I have a few key questions:

  1. Is it possible to use data from both BigQuery (BQ) and Datastore (kinds) together for building a Q&A system with Gemini?
  2. Would Datastore data need to be transformed before using it in a RAG pipeline? If so, what are the best practices?
  3. Are there any official documentation, best practices, or implementation examples for using both BigQuery and Datastore in a RAG-based Q&A system with Gemini?

Any insights, recommendations, or relevant documentation links would be greatly appreciated!

Thanks in advance!

Solved Solved
0 2 1,382
1 ACCEPTED SOLUTION

Hi Nikita_G,

Welcome to the Google Cloud Community!

It seems like you're working on a sophisticated Q&A system, leveraging the Gemini model with a Retrieval-Augmented Generation (RAG) approach. Let me address your key question one by one:

1. Is it possible to use data from both BigQuery (BQ) and Datastore (kinds) together for building a Q&A system with Gemini?

Yes, it is entirely possible to use data from both BigQuery and Datastore in a RAG architecture. This allows you to build a comprehensive Q&A system by drawing from different data sources.

2. Would Datastore data need to be transformed before using it in a RAG pipeline? If so, what are the best practices? 

Datastore data generally needs transformation before it can be used effectively in a RAG pipeline. The key steps in this process include extracting the relevant text data from your Datastore entities and applying cleaning and normalization techniques, such as removing HTML tags and converting text to lowercase. If your text extracts are very large, you may need to split them into smaller chunks, although this step is optional and depends on the document size and the limitations of your model. The next step involves converting the text chunks into numerical vector representations using an appropriate model. Finally, you will need to build a vector index for efficient similarity search based on the generated embeddings.

3. Are there any official documentation, best practices, or implementation examples for using both BigQuery and Datastore in a RAG-based Q&A system with Gemini? 

Currently, there isn't a single, official Google Cloud guide or tutorial that explicitly demonstrates building a Retrieval-Augmented Generation (RAG) Q&A system using both BigQuery and Datastore as knowledge sources with Gemini. Instead, you'll need to combine insights and techniques from various Google Cloud and Gemini resources to create your solution.Here are some key resources and best practices that you can use to guide your implementation:

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

View solution in original post

2 REPLIES 2

Hi Nikita_G,

Welcome to the Google Cloud Community!

It seems like you're working on a sophisticated Q&A system, leveraging the Gemini model with a Retrieval-Augmented Generation (RAG) approach. Let me address your key question one by one:

1. Is it possible to use data from both BigQuery (BQ) and Datastore (kinds) together for building a Q&A system with Gemini?

Yes, it is entirely possible to use data from both BigQuery and Datastore in a RAG architecture. This allows you to build a comprehensive Q&A system by drawing from different data sources.

2. Would Datastore data need to be transformed before using it in a RAG pipeline? If so, what are the best practices? 

Datastore data generally needs transformation before it can be used effectively in a RAG pipeline. The key steps in this process include extracting the relevant text data from your Datastore entities and applying cleaning and normalization techniques, such as removing HTML tags and converting text to lowercase. If your text extracts are very large, you may need to split them into smaller chunks, although this step is optional and depends on the document size and the limitations of your model. The next step involves converting the text chunks into numerical vector representations using an appropriate model. Finally, you will need to build a vector index for efficient similarity search based on the generated embeddings.

3. Are there any official documentation, best practices, or implementation examples for using both BigQuery and Datastore in a RAG-based Q&A system with Gemini? 

Currently, there isn't a single, official Google Cloud guide or tutorial that explicitly demonstrates building a Retrieval-Augmented Generation (RAG) Q&A system using both BigQuery and Datastore as knowledge sources with Gemini. Instead, you'll need to combine insights and techniques from various Google Cloud and Gemini resources to create your solution.Here are some key resources and best practices that you can use to guide your implementation:

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Thanks for the insights. I'm trying to understand the best approach for generating embeddings from data stored in Datastore.

My understanding is that Datastore itself doesn't offer native vector search capabilities like BigQuery ML or dedicated vector databases. This makes me question whether we can directly use Datastore data for embedding-based retrieval. Is this correct, or is there a way to perform vector search directly within Datastore?

I know that BigQuery allows us to create external connections and integrate with Vertex AI models to generate text embeddings using SQL. However, I haven't found any documentation on directly connecting Datastore to Vertex AI for embedding generation.

Therefore, I'm wondering if the only viable option is to transfer the Datastore data into BigQuery first, and then generate the embeddings within BigQuery. Could you please confirm if this is the recommended approach, or if there are alternative methods I should consider?

Thanks!