While the development of Generative AI is progressing rapidly, evaluating performance of these systems, especially in complex tasks like question-answering remains a significant challenge. This project demonstrates how to address this challenge through building and evaluating multiple Retrieval-Augmented Generation (RAG) Question-Answering bots using Vertex AI Model Garden and Galileo Evaluate.
Model Garden on Vertex AI is a single destination offering a selection of high-performing foundation models (FMs) from leading AI companies like Anthropic, Cohere, Meta, Stability AI, and Google through a single API. Vertex AI is a platform providing a broad set of capabilities to build predictive and generative AI applications on Google Cloud with security, privacy, and scalability in mind.
Galileo Observe is a flexible module built on top of Galileo’s Evaluation Intelligence Platform. The module is used for monitoring LLM outputs in production, or ‘online evaluation’. For detecting common hallucinations that measure things like factuality, context adherence, toxicity, PII leakage, and bias, it leverages Luna, a suite of proprietary metrics and research-backed Evaluation Foundation Models. The Galileo LLM studio is also available on the Google Cloud Marketplace here.
Solution overview
We use a sample use case to illustrate the process by building a QA bot designed to answer questions based on a context derived from Wikipedia articles about Formula 1.
The workflow includes the following steps:
We're creating a Question-Answering bot about Formula 1 using Wikipedia articles. But here's the twist: we're making three versions, each powered by a different Large Language Model (LLM). Then, we'll put them to the test and see how they stack up.
For the orchestration and automation steps in this process, we use LangChain. LangChain is an open source Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.
The next sections walk you through the most important parts of the process. If you want to dive deeper and run it yourself, refer to the notebook at galileo_eval_model_garden.ipynb.
Before we jump in, make sure you have:
For more information you can refer to the documentation here - Choose a notebook solution | Vertex AI Workbench | Google Cloud
Now to follow along just launch a notebook in the studio, you can select the default kernel, we are not going to be doing any computationally heavy operations.
First, let’s load some Formula 1 docs using LangChain’s WikipediaLoader.
Next we use the CharacterTextSplitter to split the PDF documents into chunks. The CharacterTextSplitter divides the text into chunks of a specified size while trying to preserve context and meaning of the content. It’s a good way to start when working with text-based documents. You don’t have to split your documents to create your evaluation dataset if your LLM supports a context window that is large enough to fit your documents, but you could potentially end up with a lower quality of generated questions due to the larger size of the task.
You can also play around with the chunk size and see how it impacts your LLM’s performance with the help of the chunk utilization percentage metric in Galileo Evaluate.
Next we compute embeddings for the chunks using Vertex AI’s text embedding model and then store these embeddings in FAISS, a vector database, for efficient retrieval
To facilitate prompting the LLM using the Vertex SDK and LangChain, we initialize a client to generate a response.
Next we list down the models we wish to evaluate. For this lab we will be using 3 models from the Google Cloud Model Garden. Feel free to change them as needed and play around, make sure you have the quota and permissions to run them.
Now we define some questions we wish to ask the LLM. The first 8 are relevant to the context, the next 2 are irrelevant to change metrics, while the last 2 are prompt injection attacks. Feel free to change them as needed and play around.
To ensure the LLM has sufficient information to answer each question, we use the FAISS vector database to extract the relevant context sentences from the given question. Knowing the relevant sentences, you can check whether the question and answer are correct.
We also log this step to Galileo Evaluate.
You can see that for each question, we fetch the TOP 3 closest chunks from the vector database. You can use the chunk-attribution metric in Galileo Evaluate!
Now comes the main part where we pass thes retrieved documents as context, along with the question, to Google Cloud Model Garden LLMs and generate answers based on the provided context and question.
You can have a look at the final results (screenshot below) in the console. For a feel of the console and its offerings check out our video here.
In the screenshot above, you can see a bunch of metrics displayed. Let’s have a look at some of these metrics -
To read more about the metrics you can refer to Galileo’s docs.
All of the above metrics are computed by using Galileo’s in-house small language models (Luna). The model is trained on carefully curated RAG datasets and optimized to closely align with the RAG Plus metrics. You can read more about the model here - Luna: An Evaluation Foundation Model
We have explored the process of creating and evaluating a chatbot for a QA-RAG application using Google Cloud's Model Garden via the Vertex AI API, Python, and Langchain. We covered essential steps, including setting up the environment, loading and preparing context data, extracting relevant context, answer generation, and logging to Galileo.
Great post @taiconley and @vatsal_gl !