In our previous post, we explored why trajectory evaluation and handoffs are critical for multi-agent systems. Now, we'll introduce agent evaluation in the context of A2A (Agent2Agent) interactions through a simple, practical example.
In our forthcoming posts, we will delve deeper into this topic, focusing on the mechanics of handoffs and the important distinction between agents and tools within a trajectory.
This post demonstrates how to evaluate an A2A Reimbursement Agent running in-memory using Vertex AI Evaluation services.
Prerequisites:
from google.colab import auth
and auth.authenticate_user()
.ReimbursementAgentExecutor
class) must be defined or importable within this notebook. This executor should have a method like async def execute(self, message_payload: a2a.types.MessagePayload) -> a2a.types.Message:
.This post contains two sections.
A. Agent setup
1. Setup and Installs
We will be leveraging the A2A (Agent2Agent) and ADK (Agent Development Kit) Python SDKs in this tutorial. To learn more about the Agent2Agent Protocol, you can review the official A2A documentation.
!pip install google-cloud-aiplatform httpx "a2a-sdk==0.2.5" --quiet
!pip install --upgrade --quiet 'google-adk==1.2.0'
Follow the tutorial for imports and configurations.
2. Defining Reimbursement Agent
This section defines an AI-powered Reimbursement Agent using Google's Agent Development Kit (ADK). Its purpose is to automate the process of handling employee reimbursement requests through a conversational interface.
The agent has access to the following tools: create_request_form
, return_form
, and reimburse
.
The ReimbursementAgent
class orchestrates the entire process.
**__init__(self)**
: The constructor initializes the agent. It builds the agent's logic by calling _build_agent
and sets up a Runner
. The runner is the engine that executes the agent's tasks, using in-memory services for sessions and artifacts, which means it doesn't need an external database for this example.
**_build_agent(self)**
: This is the core of the agent's definition.
Model: It specifies gemini-2.0-flash-001
as the LLM.
Instructions: It provides a detailed, multi-step prompt that tells the LLM exactly how to behave.
**stream(self, query, session_id)**
: This asynchronous method is the entry point for handling user input. It manages the conversation session and streams responses back. As the agent processes a request, it can yield
intermediate updates (e.g., "Processing the reimbursement request...") before sending the final, structured response.
3. Defining Agent Executor
The code defines an Agent Executor, a component that acts as a controller for the ReimbursementAgent
you saw previously.
The ReimbursementAgent
is the specialized worker that knows the steps to handle a reimbursement (create form, check details, approve).
The ReimbursementAgentExecutor
is the supervisor that gives the agent the user's request, manages the task's lifecycle, and translates the agent's work into status updates for the larger system.
The executor's main job is to run the agent and manage the communication flow.
**__init__(self)**
: When the executor is created, it immediately creates an instance of the ReimbursementAgent
, holding it internally.
**execute(...)**
: This is the primary method and contains the main operational logic.
**cancel(...)**
: This method is explicitly not implemented. If a cancellation is requested, it raises an UnsupportedOperationError
, making it clear that this feature is not available in this executor.4. Defining A2A server
Next, you need to define an A2A (Agent-to-Agent) server to expose the ReimbursementAgent
and its Executor
to the outside world. The server acts as a web endpoint, allowing other applications, like a chat app or a web browser to communicate with your agent over a network.
capabilities = AgentCapabilities(streaming=True)
skill = AgentSkill(
id='process_reimbursement',
name='Process Reimbursement Tool',
description='Helps with the reimbursement process for users given the amount and purpose of the reimbursement.',
tags=['reimbursement'],
examples=[
'Can you reimburse me $20 for my lunch with the clients?'
],
)
agent_card = AgentCard(
name='Reimbursement Agent',
description='This agent handles the reimbursement process for the employees given the amount and purpose of the reimbursement.',
url='http://localhost/agent', # Placeholder, not used by TestClient
# url=f'http://{host}:{port}/',
version='1.0.0',
defaultInputModes=ReimbursementAgent.SUPPORTED_CONTENT_TYPES,
defaultOutputModes=ReimbursementAgent.SUPPORTED_CONTENT_TYPES,
capabilities=capabilities,
skills=[skill],
)
request_handler = DefaultRequestHandler(
agent_executor=ReimbursementAgentExecutor(),
task_store=InMemoryTaskStore(),
)
server = A2AStarletteApplication(
agent_card=agent_card, http_handler=request_handler
)
# Build the Starlette ASGI app
# This `starlette_app` can be served by Uvicorn or used with TestClient
expense_starlette_app = server.build()
5. Test Client
This code uses a TestClient
to simulate a client application (like a chat app) talking to your agent server. It runs two main tests to ensure the server is working correctly. These tests are (1) Getting the agent card (2) Simulating a user sending a message to the agent to start a task.
# Basic logging setup (helpful for seeing what the handler does)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# TestClient should be used as a context manager or closed explicitly
with TestClient(expense_starlette_app) as client:
logger.info("\n--- Test 1: Get Agent Card ---")
response = client.get("/.well-known/agent.json")
assert response.status_code == 200
agent_card_data = response.json()
print(f"SUCCESS: Agent Card received: {agent_card_data['name']}")
print("A2AClient initialized.")
print("\n--- Quick Test : Non-streaming RPC - message/send ---")
message_id_send = "colab-msg-007"
rpc_request_send_msg = {
"jsonrpc": "2.0",
"id": "colab-req-send-msg-1",
"method": "message/send",
"params": {
"message": {
"role": "user",
"parts": [{"kind": "text", "text": "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?"}], # good one
"messageId": message_id_send,
"kind": "message",
"contextId": "colab-session-xyz"
}
}
}
response = client.post("/", json=rpc_request_send_msg)
assert response.status_code == 200
rpc_response_send_msg = response.json()
print(f"message/send response: {json.dumps(rpc_response_send_msg, indent=2)}")
print(f"SUCCESS: message/send for '{message_id_send}' passed.")
B. Evaluation
1. Imports and helper functions
Import packages and define helper functions as shown in the colab notebook.
**parse_a2a_output_to_dictionary()**
and **parse_adk_output_to_dictionary()**
take the raw JSON or event output and extract the most important information: the agent's final text response and the sequence of tools it called (its "trajectory"). This ties to what we had discussed earlier and in our previous post about evaluating trajectories.
2. Assembling the agents
def a2a_parsed_outcome(query):
# TestClient should be used as a context manager or closed explicitly
# query = "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?"
with TestClient(expense_starlette_app) as client:
print("\n--- Get Agent Card ---")
response = client.get("/.well-known/agent.json")
assert response.status_code == 200
agent_card_data = response.json()
print(f"--- SUCCESS: Agent Card received: {agent_card_data['name']} ---")
print("--- A2AClient initialized. ---")
print(f"Query: {query}")
message_id_send = f"colab-msg-{get_id()}"
rpc_request_send_msg = {
"jsonrpc": "2.0",
"id": f"colab-req-send-msg-{get_id()}",
"method": "message/send",
"params": {
"message": {
"role": "user",
"parts": [{"kind": "text", "text": query}], # good one
"messageId": message_id_send,
"kind": "message",
"contextId": "colab-session-xyz"
}
}
}
response = client.post("/", json=rpc_request_send_msg)
assert response.status_code == 200
rpc_response_send_msg = response.json()
print(f"SUCCESS: message/send for '{message_id_send}' Finished")
return parse_a2a_output_to_dictionary(rpc_response_send_msg)
With this setup, lets query the agents with some quick examples.
response = a2a_parsed_outcome(query="Get product details for shoes")
display(Markdown(format_output_as_markdown(response)))
response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?")
display(Markdown(format_output_as_markdown(response)))
response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $311 for my flights from SFO to SEA on 06/11/2025?")
display(Markdown(format_output_as_markdown(response)))
response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $50 for my lunch with the clients on Jan 2nd,2024?")
display(Markdown(format_output_as_markdown(response)))
4. Prepare agent evaluation dataset
To evaluate your AI agent using the Vertex AI Gen AI Evaluation service, you need a specific dataset depending on what aspects you want to evaluate of your agent.
This dataset should include the prompts given to the agent. It can also contain the ideal or expected response (ground truth) and the intended sequence of tool calls the agent should take (reference trajectory) representing the sequence of tools you expect agent calls for each given prompt.
Optionally, you can provide both generated responses and predicted trajectory (Bring-Your-Own-Dataset scenario).
Below you have an example of dataset you might have with a customer support agent with user prompt and the reference trajectory.
#@title Define eval datasets
# The reference trajectory are empty in this example.
eval_data_a2a = {
"prompt": [
"Get product details for shoes",
"Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?",
"Hello Agent, Please reimburse me $20 for my lunch with the clients",
"Please reimburse me $312 for my meal with the clients on 06/05/2025?",
"Please reimburse me $1234 for my flight to Seattle on 06/11/2025?",
],
"reference_trajectory": [
[],[],[],[],[],
],
}
eval_sample_dataset = pd.DataFrame(eval_data_a2a)
#@title Define eval datasets
# The reference trajectory are empty in this example.
eval_data_a2a = {
"prompt": [
"Get product details for shoes",
"Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?",
"Hello Agent, Please reimburse me $20 for my lunch with the clients",
"Please reimburse me $312 for my meal with the clients on 06/05/2025?",
"Please reimburse me $1234 for my flight to Seattle on 06/11/2025?",
],
"reference_trajectory": [
[],[],[],[],[],
],
}
eval_sample_dataset = pd.DataFrame(eval_data_a2a)
display_dataframe_rows(eval_sample_dataset, num_rows=30)
5. Evaluate final response
EXPERIMENT_RUN = f"response-{get_id()}"
response_eval_task = EvalTask(
dataset=eval_sample_dataset,
metrics=response_metrics,
experiment=EXPERIMENT_NAME,
output_uri_prefix=BUCKET_URI + "/response-metric-eval",
)
response_eval_result = response_eval_task.evaluate(
runnable=a2a_parsed_outcome, experiment_run_name=EXPERIMENT_RUN
)
display_eval_report(response_eval_result)
Sample Output
Feel free to check out the A2A project for more information. Do check out our thought leadership on Why Agents are not Tools.