Evaluating RAG Question Answer bots with Vertex AI and Galileo Evaluate

Authors:
Vatsal Goel,  Applied Data Scientist, Galileo
Tai Conley, Partner Engineering, Google

Overview

While the development of Generative AI is progressing rapidly, evaluating performance of these systems, especially in complex tasks like question-answering remains a significant challenge. This project demonstrates how to address this challenge through building and evaluating multiple Retrieval-Augmented Generation (RAG) Question-Answering bots using Vertex AI Model Garden and Galileo Evaluate.

Model Garden on Vertex AI is a single destination offering a selection of high-performing foundation models (FMs) from leading AI companies like Anthropic, Cohere, Meta, Stability AI, and Google through a single API. Vertex AI is a platform providing a broad set of capabilities to build predictive and generative AI applications on Google Cloud with security, privacy, and scalability in mind. 

Galileo Observe is a flexible module built on top of Galileo’s Evaluation Intelligence Platform. The module is used for monitoring LLM outputs in production, or ‘online evaluation’. For detecting common hallucinations that measure things like factuality, context adherence, toxicity, PII leakage, and bias, it leverages Luna, a suite of proprietary metrics and research-backed Evaluation Foundation Models. The Galileo LLM studio is also available on the Google Cloud Marketplace here.

Solution overview

We use a sample use case to illustrate the process by building a QA bot designed to answer questions based on a context derived from Wikipedia articles about Formula 1.

The workflow includes the following steps:

  1. Load the data from your data source (eg. wikipedia in our case).
  2. Chunk the data as you would for your RAG application.
  3. Generate embeddings for each chunk and store them in a vector database.
  4. Define a list of questions for the bot.
  5. Define a list of models to be evaluated.
  6. Extract the relevant context from the vector database that answers the question.
  7. For each model, prompt the LLM with the question and context to get an answer.
  8. Log and upload the workflow for question -> context -> answer to Galileo Evaluate.

We're creating a Question-Answering bot about Formula 1 using Wikipedia articles. But here's the twist: we're making three versions, each powered by a different Large Language Model (LLM). Then, we'll put them to the test and see how they stack up.

For the orchestration and automation steps in this process, we use LangChain. LangChain is an open source Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.

The next sections walk you through the most important parts of the process. If you want to dive deeper and run it yourself, refer to the notebook at galileo_eval_model_garden.ipynb.

Prerequisites

Before we jump in, make sure you have:

  1. A Google Cloud Platform (GCP) account and project
  2. Vertex AI API enabled
  3. A Vertex AI Workbench notebook

For more information you can refer to the documentation here - Choose a notebook solution | Vertex AI Workbench | Google Cloud

Now to follow along just launch a notebook in the studio, you can select the default kernel, we are not going to be doing any computationally heavy operations.

Load and prepare the data

Document Retrieval

First, let’s load some Formula 1 docs using LangChain’s WikipediaLoader. 

taiconley_0-1730318667779.png

Document Processing

Next we use the CharacterTextSplitter to split the PDF documents into chunks. The CharacterTextSplitter divides the text into chunks of a specified size while trying to preserve context and meaning of the content. It’s a good way to start when working with text-based documents. You don’t have to split your documents to create your evaluation dataset if your LLM supports a context window that is large enough to fit your documents, but you could potentially end up with a lower quality of generated questions due to the larger size of the task.

You can also play around with the chunk size and see how it impacts your LLM’s performance with the help of the chunk utilization percentage metric in Galileo Evaluate.

taiconley_1-1730318667756.png

 

 

Next we compute embeddings for the chunks using Vertex AI’s text embedding model and then store these embeddings in FAISS, a vector database, for efficient retrieval

taiconley_2-1730318667778.png

 

LLM Setup and Question Definition

To facilitate prompting the LLM using the Vertex SDK and LangChain, we initialize a client to generate a response.

taiconley_3-1730318667789.png

 

 

 

 

Next we list down the models we wish to evaluate. For this lab we will be using 3 models from the Google Cloud Model Garden. Feel free to change them as needed and play around, make sure you have the quota and permissions to run them.

taiconley_4-1730318667775.png

 

 

 

 

 

Now we define some questions we wish to ask the LLM. The first 8 are relevant to the context, the next 2 are irrelevant to change metrics, while the last 2 are prompt injection attacks. Feel free to change them as needed and play around.

taiconley_5-1730318667796.png

 

 

 

 

Extract relevant context

To ensure the LLM has sufficient information to answer each question, we use the FAISS vector database to extract the relevant context sentences from the given question. Knowing the relevant sentences, you can check whether the question and answer are correct.

taiconley_6-1730318667753.png

 

 

We also log this step to Galileo Evaluate.

taiconley_7-1730318667763.png

 

 

You can see that for each question, we fetch the TOP 3  closest chunks from the vector database. You can use the chunk-attribution metric in Galileo Evaluate!

Generate answers and log to Galileo Evaluate

Now comes the main part where we pass thes retrieved documents as context, along with the question, to Google Cloud Model Garden LLMs and generate answers based on the provided context and question.

taiconley_8-1730318667794.png

 

 

 

 

 

 

 

You can have a look at the final results (screenshot below) in the console. For a feel of the console and its offerings check out our video here.

taiconley_9-1730318667797.png

 

 

 

 

In the screenshot above, you can see a bunch of metrics displayed. Let’s have a look at some of these metrics - 

  • Context Adherence - Context Adherence is a measurement of closed-domain hallucinations: cases where your model said things that were not provided in the context.
  • Completeness - This measures how thoroughly your model’s response covered the relevant information available in the context provided.
  • Chunk Attribution - For each chunk retrieved in a RAG pipeline, Chunk Attribution measures whether or not that chunk had an effect on the model’s response. This metric helps you tune your TOP K chunks retrieved during a RAG cycle.
  • Chunk Utilization - For each chunk retrieved in a RAG pipeline, Chunk Utilization measures the fraction of the text in that chunk that had an impact on the model’s response. This metric helps you tune your chunk size which is stored in the vector database.
  • Chunk Relevance - For each chunk retrieved in a RAG pipeline, Chunk Relevance detects the sections of the text that contain useful information to address the query.

To read more about the metrics you can refer to Galileo’s docs.

All of the above metrics are computed by using Galileo’s in-house small language models (Luna). The model is trained on carefully curated RAG datasets and optimized to closely align with the RAG Plus metrics. You can read more about the model here - Luna: An Evaluation Foundation Model

Conclusion

We have explored the process of creating and evaluating a chatbot for a QA-RAG application using Google Cloud's Model Garden via the Vertex AI API, Python, and Langchain. We covered essential steps, including setting up the environment, loading and preparing context data, extracting relevant context, answer generation, and logging to Galileo.

Contributors
Comments
tomcannon
Staff

Great post @taiconley and @vatsal_gl !

Version history
Last update:
‎10-31-2024 08:48 AM
Updated by: