Evaluating A2A agent using Vertex AI evaluation service

Junjie_Bu

In our previous post, we explored why trajectory evaluation and handoffs are critical for multi-agent systems. Now, we'll introduce agent evaluation in the context of A2A (Agent2Agent) interactions through a simple, practical example.

In our forthcoming posts, we will delve deeper into this topic, focusing on the mechanics of handoffs and the important distinction between agents and tools within a trajectory.

This post demonstrates how to evaluate an A2A Reimbursement Agent running in-memory using Vertex AI Evaluation services.

Link to the Colab Notebook

Prerequisites:

Google Cloud Project: You need a Google Cloud Project with the Vertex AI API enabled.
Authentication: You need to be authenticated to Google Cloud. In a Colab environment, this is usually handled by running from google.colab import auth and auth.authenticate_user().
Agent logic: The core logic for the Reimbursement Agent (e.g., a ReimbursementAgentExecutor class) must be defined or importable within this notebook. This executor should have a method like async def execute(self, message_payload: a2a.types.MessagePayload) -> a2a.types.Message:.

This post contains two sections.

Agent setup
Evaluation

A. Agent setup

1. Setup and Installs

We will be leveraging the A2A (Agent2Agent) and ADK (Agent Development Kit) Python SDKs in this tutorial. To learn more about the Agent2Agent Protocol, you can review the official A2A documentation.

!pip install google-cloud-aiplatform httpx "a2a-sdk==0.2.5" --quiet
!pip install --upgrade --quiet 'google-adk==1.2.0'

Follow the tutorial for imports and configurations.

2. Defining Reimbursement Agent

This section defines an AI-powered Reimbursement Agent using Google's Agent Development Kit (ADK). Its purpose is to automate the process of handling employee reimbursement requests through a conversational interface.

The agent has access to the following tools: create_request_form, return_form, and reimburse.

The ReimbursementAgent class orchestrates the entire process.

**__init__(self)**: The constructor initializes the agent. It builds the agent's logic by calling _build_agent and sets up a Runner. The runner is the engine that executes the agent's tasks, using in-memory services for sessions and artifacts, which means it doesn't need an external database for this example.
**_build_agent(self)**: This is the core of the agent's definition.
- Model: It specifies gemini-2.0-flash-001 as the LLM.
- Instructions: It provides a detailed, multi-step prompt that tells the LLM exactly how to behave.
**stream(self, query, session_id)**: This asynchronous method is the entry point for handling user input. It manages the conversation session and streams responses back. As the agent processes a request, it can yield intermediate updates (e.g., "Processing the reimbursement request...") before sending the final, structured response.

3. Defining Agent Executor

The code defines an Agent Executor, a component that acts as a controller for the ReimbursementAgent you saw previously.

The ReimbursementAgent is the specialized worker that knows the steps to handle a reimbursement (create form, check details, approve).
The ReimbursementAgentExecutor is the supervisor that gives the agent the user's request, manages the task's lifecycle, and translates the agent's work into status updates for the larger system.

The executor's main job is to run the agent and manage the communication flow.

**__init__(self)**: When the executor is created, it immediately creates an instance of the ReimbursementAgent, holding it internally.
**execute(...)**: This is the primary method and contains the main operational logic.
**cancel(...)**: This method is explicitly not implemented. If a cancellation is requested, it raises an UnsupportedOperationError, making it clear that this feature is not available in this executor.

4. Defining A2A server

Next, you need to define an A2A (Agent-to-Agent) server to expose the ReimbursementAgent and its Executor to the outside world. The server acts as a web endpoint, allowing other applications, like a chat app or a web browser to communicate with your agent over a network.

capabilities = AgentCapabilities(streaming=True)
skill = AgentSkill(
    id='process_reimbursement',
    name='Process Reimbursement Tool',
    description='Helps with the reimbursement process for users given the amount and purpose of the reimbursement.',
    tags=['reimbursement'],
    examples=[
        'Can you reimburse me $20 for my lunch with the clients?'
    ],
)
agent_card = AgentCard(
    name='Reimbursement Agent',
    description='This agent handles the reimbursement process for the employees given the amount and purpose of the reimbursement.',
    url='http://localhost/agent', # Placeholder, not used by TestClient
    # url=f'http://{host}:{port}/',
    version='1.0.0',
    defaultInputModes=ReimbursementAgent.SUPPORTED_CONTENT_TYPES,
    defaultOutputModes=ReimbursementAgent.SUPPORTED_CONTENT_TYPES,
    capabilities=capabilities,
    skills=[skill],
)
request_handler = DefaultRequestHandler(
    agent_executor=ReimbursementAgentExecutor(),
    task_store=InMemoryTaskStore(),
)
server = A2AStarletteApplication(
    agent_card=agent_card, http_handler=request_handler
)

# Build the Starlette ASGI app
# This `starlette_app` can be served by Uvicorn or used with TestClient
expense_starlette_app = server.build()

capabilities = AgentCapabilities(streaming=True) skill = AgentSkill( id='process_reimbursement', name='Process Reimbursement Tool', description='Helps with the reimbursement process for users given the amount and purpose of the reimbursement.', tags=['reimbursement'], examples=[ 'Can you reimburse me $20 for my lunch with the clients?' ], ) agent_card = AgentCard( name='Reimbursement Agent', description='This agent handles the reimbursement process for the employees given the amount and purpose of the reimbursement.', url='http://localhost/agent', # Placeholder, not used by TestClient # url=f'http://{host}:{port}/', version='1.0.0', defaultInputModes=ReimbursementAgent.SUPPORTED_CONTENT_TYPES, defaultOutputModes=ReimbursementAgent.SUPPORTED_CONTENT_TYPES, capabilities=capabilities, skills=[skill], ) request_handler = DefaultRequestHandler( agent_executor=ReimbursementAgentExecutor(), task_store=InMemoryTaskStore(), ) server = A2AStarletteApplication( agent_card=agent_card, http_handler=request_handler ) # Build the Starlette ASGI app # This `starlette_app` can be served by Uvicorn or used with TestClient expense_starlette_app = server.build()

5. Test Client

This code uses a TestClient to simulate a client application (like a chat app) talking to your agent server. It runs two main tests to ensure the server is working correctly. These tests are (1) Getting the agent card (2) Simulating a user sending a message to the agent to start a task.

# Basic logging setup (helpful for seeing what the handler does)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# TestClient should be used as a context manager or closed explicitly
with TestClient(expense_starlette_app) as client:
    logger.info("\n--- Test 1: Get Agent Card ---")
    response = client.get("/.well-known/agent.json")
    assert response.status_code == 200
    agent_card_data = response.json()
    print(f"SUCCESS: Agent Card received: {agent_card_data['name']}")
    print("A2AClient initialized.")

    print("\n--- Quick Test : Non-streaming RPC - message/send ---")
    message_id_send = "colab-msg-007"
    rpc_request_send_msg = {
        "jsonrpc": "2.0",
        "id": "colab-req-send-msg-1",
        "method": "message/send",
        "params": {
            "message": {
                "role": "user",
                "parts": [{"kind": "text", "text": "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?"}],  # good one
                "messageId": message_id_send,
                "kind": "message",
                "contextId": "colab-session-xyz"
            }
        }
    }
    response = client.post("/", json=rpc_request_send_msg)
    assert response.status_code == 200
    rpc_response_send_msg = response.json()
    print(f"message/send response: {json.dumps(rpc_response_send_msg, indent=2)}")
    print(f"SUCCESS: message/send for '{message_id_send}' passed.")

# Basic logging setup (helpful for seeing what the handler does) logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) # TestClient should be used as a context manager or closed explicitly with TestClient(expense_starlette_app) as client: logger.info("\n--- Test 1: Get Agent Card ---") response = client.get("/.well-known/agent.json") assert response.status_code == 200 agent_card_data = response.json() print(f"SUCCESS: Agent Card received: {agent_card_data['name']}") print("A2AClient initialized.") print("\n--- Quick Test : Non-streaming RPC - message/send ---") message_id_send = "colab-msg-007" rpc_request_send_msg = { "jsonrpc": "2.0", "id": "colab-req-send-msg-1", "method": "message/send", "params": { "message": { "role": "user", "parts": [{"kind": "text", "text": "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?"}], # good one "messageId": message_id_send, "kind": "message", "contextId": "colab-session-xyz" } } } response = client.post("/", json=rpc_request_send_msg) assert response.status_code == 200 rpc_response_send_msg = response.json() print(f"message/send response: {json.dumps(rpc_response_send_msg, indent=2)}") print(f"SUCCESS: message/send for '{message_id_send}' passed.")

B. Evaluation

1. Imports and helper functions

Import packages and define helper functions as shown in the colab notebook.

**parse_a2a_output_to_dictionary()** and **parse_adk_output_to_dictionary()** take the raw JSON or event output and extract the most important information: the agent's final text response and the sequence of tools it called (its "trajectory"). This ties to what we had discussed earlier and in our previous post about evaluating trajectories.

2. Assembling the agents

The Vertex AI Gen AI Evaluation works directly with 'Queryable' agents, and also lets you add your own custom functions with a specific structure (signature).

In this case, you assemble the agent using a custom function. The function triggers the agent for a given input and parse the agent outcome to extract the response and called tools.

def a2a_parsed_outcome(query):
  # TestClient should be used as a context manager or closed explicitly
  # query = "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?"

  with TestClient(expense_starlette_app) as client:
      print("\n--- Get Agent Card ---")
      response = client.get("/.well-known/agent.json")
      assert response.status_code == 200
      agent_card_data = response.json()
      print(f"--- SUCCESS: Agent Card received: {agent_card_data['name']} ---")
      print("--- A2AClient initialized. ---")
      print(f"Query: {query}")
      message_id_send = f"colab-msg-{get_id()}"
      rpc_request_send_msg = {
          "jsonrpc": "2.0",
          "id": f"colab-req-send-msg-{get_id()}",
          "method": "message/send",
          "params": {
              "message": {
                  "role": "user",
                  "parts": [{"kind": "text", "text": query}],  # good one
                  "messageId": message_id_send,
                  "kind": "message",
                  "contextId": "colab-session-xyz"
              }
          }
      }
      response = client.post("/", json=rpc_request_send_msg)
      assert response.status_code == 200
      rpc_response_send_msg = response.json()
      print(f"SUCCESS: message/send for '{message_id_send}' Finished")
      return parse_a2a_output_to_dictionary(rpc_response_send_msg)

def a2a_parsed_outcome(query): # TestClient should be used as a context manager or closed explicitly # query = "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?" with TestClient(expense_starlette_app) as client: print("\n--- Get Agent Card ---") response = client.get("/.well-known/agent.json") assert response.status_code == 200 agent_card_data = response.json() print(f"--- SUCCESS: Agent Card received: {agent_card_data['name']} ---") print("--- A2AClient initialized. ---") print(f"Query: {query}") message_id_send = f"colab-msg-{get_id()}" rpc_request_send_msg = { "jsonrpc": "2.0", "id": f"colab-req-send-msg-{get_id()}", "method": "message/send", "params": { "message": { "role": "user", "parts": [{"kind": "text", "text": query}], # good one "messageId": message_id_send, "kind": "message", "contextId": "colab-session-xyz" } } } response = client.post("/", json=rpc_request_send_msg) assert response.status_code == 200 rpc_response_send_msg = response.json() print(f"SUCCESS: message/send for '{message_id_send}' Finished") return parse_a2a_output_to_dictionary(rpc_response_send_msg)

3. Test the A2A Agent

With this setup, lets query the agents with some quick examples.

response = a2a_parsed_outcome(query="Get product details for shoes")
display(Markdown(format_output_as_markdown(response)))

response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?")
display(Markdown(format_output_as_markdown(response)))

response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $311 for my flights from SFO to SEA on 06/11/2025?")
display(Markdown(format_output_as_markdown(response)))

response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $50 for my lunch with the clients on Jan 2nd,2024?")
display(Markdown(format_output_as_markdown(response)))

response = a2a_parsed_outcome(query="Get product details for shoes") display(Markdown(format_output_as_markdown(response))) response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?") display(Markdown(format_output_as_markdown(response))) response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $311 for my flights from SFO to SEA on 06/11/2025?") display(Markdown(format_output_as_markdown(response))) response = a2a_parsed_outcome(query="Hello Agent, Please reimburse me $50 for my lunch with the clients on Jan 2nd,2024?") display(Markdown(format_output_as_markdown(response)))

4. Prepare agent evaluation dataset

To evaluate your AI agent using the Vertex AI Gen AI Evaluation service, you need a specific dataset depending on what aspects you want to evaluate of your agent.

This dataset should include the prompts given to the agent. It can also contain the ideal or expected response (ground truth) and the intended sequence of tool calls the agent should take (reference trajectory) representing the sequence of tools you expect agent calls for each given prompt.

Optionally, you can provide both generated responses and predicted trajectory (Bring-Your-Own-Dataset scenario).

Below you have an example of dataset you might have with a customer support agent with user prompt and the reference trajectory.

#@title Define eval datasets
# The reference trajectory are empty in this example.
eval_data_a2a = {
    "prompt": [
        "Get product details for shoes",
        "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?",
        "Hello Agent, Please reimburse me $20 for my lunch with the clients",
        "Please reimburse me $312 for my meal with the clients on 06/05/2025?",
        "Please reimburse me $1234 for my flight to Seattle on 06/11/2025?",
    ],
    "reference_trajectory": [
        [],[],[],[],[],
    ],
}

eval_sample_dataset = pd.DataFrame(eval_data_a2a)

#@title Define eval datasets # The reference trajectory are empty in this example. eval_data_a2a = { "prompt": [ "Get product details for shoes", "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?", "Hello Agent, Please reimburse me $20 for my lunch with the clients", "Please reimburse me $312 for my meal with the clients on 06/05/2025?", "Please reimburse me $1234 for my flight to Seattle on 06/11/2025?", ], "reference_trajectory": [ [],[],[],[],[], ], } eval_sample_dataset = pd.DataFrame(eval_data_a2a)

#@title Define eval datasets
# The reference trajectory are empty in this example.
eval_data_a2a = {
    "prompt": [
        "Get product details for shoes",
        "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?",
        "Hello Agent, Please reimburse me $20 for my lunch with the clients",
        "Please reimburse me $312 for my meal with the clients on 06/05/2025?",
        "Please reimburse me $1234 for my flight to Seattle on 06/11/2025?",
    ],
    "reference_trajectory": [
        [],[],[],[],[],
    ],
}

eval_sample_dataset = pd.DataFrame(eval_data_a2a)

#@title Define eval datasets # The reference trajectory are empty in this example. eval_data_a2a = { "prompt": [ "Get product details for shoes", "Hello Agent, Please reimburse me $20 for my lunch with the clients on 06/01/2025?", "Hello Agent, Please reimburse me $20 for my lunch with the clients", "Please reimburse me $312 for my meal with the clients on 06/05/2025?", "Please reimburse me $1234 for my flight to Seattle on 06/11/2025?", ], "reference_trajectory": [ [],[],[],[],[], ], } eval_sample_dataset = pd.DataFrame(eval_data_a2a)

display_dataframe_rows(eval_sample_dataset, num_rows=30)

5. Evaluate final response

Similar to model evaluation, you can evaluate the final response of the agent using Vertex AI Gen AI Evaluation.

After agent inference, Vertex AI Gen AI Evaluation provides several metrics to evaluate generated responses. You can use computation-based metrics to compare the response to a reference (if needed) and using existing or custom model-based metrics to determine the quality of the final response.

Check out the documentation to learn more.

EXPERIMENT_RUN = f"response-{get_id()}"

response_eval_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=response_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/response-metric-eval",
)

response_eval_result = response_eval_task.evaluate(
    runnable=a2a_parsed_outcome, experiment_run_name=EXPERIMENT_RUN
)

display_eval_report(response_eval_result)

EXPERIMENT_RUN = f"response-{get_id()}" response_eval_task = EvalTask( dataset=eval_sample_dataset, metrics=response_metrics, experiment=EXPERIMENT_NAME, output_uri_prefix=BUCKET_URI + "/response-metric-eval", ) response_eval_result = response_eval_task.evaluate( runnable=a2a_parsed_outcome, experiment_run_name=EXPERIMENT_RUN ) display_eval_report(response_eval_result)

Sample Output

Feel free to check out the A2A project for more information. Do check out our thought leadership on Why Agents are not Tools.

Using Vertex AI to evaluate an example A2A Agent