Solved: Using BigQuery Embeddings in a RAG Architecture

Nikita_G · 01-29-2025 09:15 PM

Hi everyone,

I am working on building a Q&A system using the Gemini model in a RAG (Retrieval-Augmented Generation) architecture, leveraging BigQuery embeddings. However, I have multiple datasets and tables in BigQuery, as well as multiple kinds in Datastore.

I have a few key questions:

Is it possible to use data from both BigQuery (BQ) and Datastore (kinds) together for building a Q&A system with Gemini?
Would Datastore data need to be transformed before using it in a RAG pipeline? If so, what are the best practices?
Are there any official documentation, best practices, or implementation examples for using both BigQuery and Datastore in a RAG-based Q&A system with Gemini?

Any insights, recommendations, or relevant documentation links would be greatly appreciated!

Thanks in advance!

nikacalupas

Hi Nikita_G,

Welcome to the Google Cloud Community!

It seems like you're working on a sophisticated Q&A system, leveraging the Gemini model with a Retrieval-Augmented Generation (RAG) approach. Let me address your key question one by one:

1. Is it possible to use data from both BigQuery (BQ) and Datastore (kinds) together for building a Q&A system with Gemini?

Yes, it is entirely possible to use data from both BigQuery and Datastore in a RAG architecture. This allows you to build a comprehensive Q&A system by drawing from different data sources.

2. Would Datastore data need to be transformed before using it in a RAG pipeline? If so, what are the best practices?

Datastore data generally needs transformation before it can be used effectively in a RAG pipeline. The key steps in this process include extracting the relevant text data from your Datastore entities and applying cleaning and normalization techniques, such as removing HTML tags and converting text to lowercase. If your text extracts are very large, you may need to split them into smaller chunks, although this step is optional and depends on the document size and the limitations of your model. The next step involves converting the text chunks into numerical vector representations using an appropriate model. Finally, you will need to build a vector index for efficient similarity search based on the generated embeddings.

3. Are there any official documentation, best practices, or implementation examples for using both BigQuery and Datastore in a RAG-based Q&A system with Gemini?

Currently, there isn't a single, official Google Cloud guide or tutorial that explicitly demonstrates building a Retrieval-Augmented Generation (RAG) Q&A system using both BigQuery and Datastore as knowledge sources with Gemini. Instead, you'll need to combine insights and techniques from various Google Cloud and Gemini resources to create your solution.Here are some key resources and best practices that you can use to guide your implementation:

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

View solution in original post

nikacalupas