From "smol" to scaled: Deploying Hugging Face’s ag...

ilnardo92 · 02-13-2025 12:32 PM

Building and scaling your own AI agent can seem complex, but it doesn't have to be! HuggingFace’s smolagent library provides a lightweight framework to build powerful agents. Vertex AI addresses deployment and management complexities for AI agents through the use of Reasoning Engine, which is a managed service for testing, deploying, and monitoring AI agent reasoning frameworks, including LangChain, LangGraph, or any custom framework. This blog post walks you through the entire process of building and deploying an agent to Vertex AI, from defining your HuggingFace’s smolagent and tools to deploying and scaling your agent.

Define your Hugging Face’s smolagent using Gemini on Vertex AI

With the Hugging Face’s smolagents library, you can create a simple agent that uses the strengths of Gemini's Function Calling for orchestration and answer generation with any tools you define. In this example, we are going to build a simple agent that is capable of answering math questions. The agent uses a math checker tool to validate the answer before showing it to the user.

To start, you define a VertexAIServerModel class by subclassing the smolagents Model class. This class represents the Gemini text generation model, which serves as the engine for your agent. Because the smolagents library does not support Vertex AI out of the box, you’ll need to write some extra code to call the Gemini API in Vertex AI. Check out the complete sample notebook to see the full code implementation.

class VertexAIServerModel(Model):
  """This model connects to an Vertex AI-compatible API server."""

...

def __call__(
        self,
        messages: list[dict[str, str]],
        **kwargs,
    ) -> ChatMessage:

        # Prepare the API call parameters
        completion_kwargs = self._prepare_completion_kwargs(
            messages=messages,
            model=self.model_id,
            **self.kwargs,
        )

        # Make the API call to Vertex AI
        response = self.client.chat.completions.create(**completion_kwargs)
        self.last_input_token_count = response.usage.prompt_tokens
        self.last_output_token_count = response.usage.completion_tokens

        # Convert API response to ChatMessage format
        message = ChatMessage.from_dict(
            response.choices[0].message.model_dump(
                include={"role", "content", "tool_calls"}
            )
        )
        return message

Next, define the tool, which is a self-contained function the agent can utilize. Because this tool uses a distilled Deepseek-R1 model deployed on Vertex AI, we subclass the tool class to call the model and verify math results.

class DeepSeekMathVerifierTool(Tool):
    """A tool that verifies math responses"""

    name = "math_verifier"
    description = """This is a tool that verifies math responses"""
    inputs = {
        "content": {
            "type": "string",
            "description": "a text containing math",
        }
    }
    output_type = "string"

	...

    def forward(self, content: str):
        """Submit the prediction request"""
        content = str(content)
        prediction_request = {
            "instances": [
                {
                    "@requestFormat": "chatCompletions",
                    "messages": [{"role": "user", "content": content}],
                }
            ]
        }

        try:
            output = self.endpoint.predict(instances=prediction_request["instances"])
        except Exception as e:
            print(f"Prediction failed: {e}")
            return None
        prediction = output.predictions[0][0]["message"]["content"]
        return prediction

Now that the model and tool have been defined, it’s time to assemble the agent and test it locally. We’ll create an instance of the Vertex AI model class, passing in the model id, google/gemini-1.5-flash in this case, as well as your Google Cloud project ID and location.

# Create model
model = VertexAIServerModel(model_id='google/gemini-1.5-flash',
                            endpoint_id='openapi',
                            project_id='your-project',
                            location='your-location')

Next, initialize the agent’s tools. When you create an instance of the DeepSeek-R1 math verifier tool, you’ll need to pass in the endpoint ID. This is the unique number that refers to the endpoint that was created in the previous section when deploying the DeepSeek model to Vertex AI. You can find the endpoint ID in the Vertex AI Model Registry.

# Define tools
tools = [DeepSeekMathVerifierTool(endpoint_id=endpoint_id, 
 project_id='your-project', 
 location='your-location')]

Smolagents provide first-class support for Code Agents, which are agents that write their actions in code. In the final step, you’ll assemble your CodeAgent, and pass in the Gemini model and the DeepSeek math tool.

# Assemble agent
agent = CodeAgent(model=model, tools=tools, add_base_tools=False)

Now, you’re ready to test out the agent! Let’s try asking it a question.

response = agent.run("Count the number of 'r' in the word Strawberry. Verify the answer")

The agent’s response should look similar to the following:

In this response you can see the agent called the math_verifier tool, which we defined earlier, powered by DeepSeek-R1 distilled model. Gemini on Vertex AI acts as the agent to return the final answer. It seems that we have found the solution! The word "Strawberry" contains three ‘r’’s.

Deploy and scale your agent with Vertex AI Reasoning Engine

After prototyping the functionality of your Gemini and DeepSeek agent in your local development environment, the next step is to move to production deployment and scaling. This often means dealing with infrastructure, containerization, scaling strategies, and monitoring.

Vertex AI addresses these deployment and management complexities for AI agents through the use of Reasoning Engine, which is a managed service for testing, deploying, and monitoring AI agent reasoning frameworks.

Reasoning Engine simplifies agent deployment and scaling, so that you can focus on the agent’s logic, quality, and capabilities. You define your agent's logic – in our case, the agent class incorporates Gemini and DeepSeek – then Reasoning Engine handles the underlying infrastructure and agent runtime.

Define your agent

To effectively use Vertex AI Reasoning Engine, defining your agent's logic is the essential first step. Reasoning Engine provides a flexible framework for this within the Vertex AI SDK, which seamlessly integrates with the Vertex AI Python SDK and offers compatibility with popular agentic frameworks like LangChain and LlamaIndex, or custom frameworks.

Deploying this Gemini and DeepSeek agent to Reasoning Engine is possible using a custom application template in Reasoning Engine. This template acts as a blueprint for your agent deployment and is highly adaptable. The two main components for building a custom agent within Reasoning Engine are the .set_up() and query() methods:

class SmolAgent:
    """Simplified SmolAgent class for Reasoning Engine deployment."""

    def __init__(...):
        """Initializes SmolAgent."""
        self.model_id = model_id
        # ...

    def set_up(self) -> None:
        """Sets up the agent components (models, tools, agent)."""
        self.model = VertexAIServerModel(...)
        self.tools = [DeepSeekMathVerifierTool(...)]
        self.app = CodeAgent(...)

    def query(self, input: str):
        """Queries the agent and returns the response."""
        return self.app.run(input)

The .set_up() method initializes the agent's core components: a VertexAIServerModel to connect to the deployed DeepSeek model, a DeepSeekMathVerifierTool for mathematical verification tasks, and a CodeAgent for code generation.

The .query() method provides a simple interface for sending input to the agent and receiving its response, effectively triggering the agent's execution.

To learn more about working with custom agents and frameworks in Reasoning Engine, check out the documentation on how to customize an application template.

Test your agent locally

Now that you've defined your agent's logic within the SmolAgent class, it's time to put it to the test. Local testing lets you simulate real-world interactions with your agent in a controlled setting before proceeding with deployment to Reasoning Engine.

You’ll instantiate your SmolAgent class and initialize it using the .set_up() method:

local_agent = SmolAgent(
    model_id="google/gemini-1.5-flash",
    endpoint_id="openapi",
    tool_endpoint_id=endpoint_id,
    project_id=PROJECT_ID,
    location=LOCATION,
)
local_agent.set_up()

After you define your agent class, you can test it locally to confirm its expected behavior:

local_agent.query(input="Count the number of 'r' in the word Strawberry. Verify the answer")

The final answer in the response from the local agent should appear similar to:

Answer: The word "Strawberry" has three 'r's because it combines "straw," which has one 'r,' and "berry," which has two 'r's.

Nice work! Now you’ve verified that your Gemini and DeepSeek agent is running smoothly locally.

Deploy your agent to Reasoning Engine

Having successfully defined and performed local testing on your agent, the next step is deploying your agent to Vertex AI Reasoning Engine. This important step transitions your agent from a local prototype to a remotely accessible service. By deploying to Reasoning Engine, you’ll be able to integrate your agent with external systems and query it as a standalone, scalable endpoint.

To deploy, you’ll use the .create() method in the Reasoning Engine SDK, provide an instance of your agent class, and specify any external dependencies or runtime requirements. Reasoning Engine then handles packaging, containerization, and deployment to a scalable endpoint:

remote_agent = reasoning_engines.ReasoningEngine.create(
    local_agent,
    requirements=[
        "google-cloud-aiplatform[reasoningengine]",
        "smolagents",
    ],
)

With your agent successfully deployed to Vertex AI Reasoning Engine, you've completed a significant milestone in bringing your AI agent to life. The deployment process, handled seamlessly by Reasoning Engine, sets the stage for the next crucial step: interacting with your deployed agent and leveraging its scalable capabilities through remote queries.

Query your deployed agent

After deploying your agent to Reasoning Engine, you've reached the point where you can directly interact with it in its managed environment and verify its functionality.

You can then send queries to this endpoint and receive responses, just as you did in your local testing. Reasoning Engine handles request routing, load balancing, and ensures reliable communication with your deployed agent:

remote_agent.query(
    input="Count the number of 'r' in the word Strawberry. Verify the answer"
)

The final answer in the response from the remote agent should appear similar to:

Answer: The word "Strawberry" has three 'r's because it combines "straw," which has one 'r,' and "berry," which has two 'r's.

Great! Our remotely deployed agent is working as expected.

At this point, you’ve deployed your Gemini and DeekSeep agent to Reasoning Engine and can interact with it from other applications or environments. The primary methods for accessing your deployed agent are via the Vertex AI Python client library or through REST API calls via tools and languages of your choosing.

What’s next

That’s a quick look at how to build and scale HuggingFace’s smolagent on Vertex AI using Reasoning Engine. As a next step, you can explore other Vertex AI features, such as the Gen AI Evaluation service to perform quality tests and help move your agent prototype into production.

If you want to get hands on with code, check out the resources below, where you’ll find documentation, a tutorial notebook, and code samples:

Build, evaluate and deploy a smolagent using DeepSeek-r1 on Vertex AI notebook
Reasoning Engine documentation
Gen AI evaluation service overview documentation
Vertex AI Model Garden documentation

Thanks for reading

Let’s connect to share feedback and questions about Vertex AI!

Ivan: LinkedIn and X

Kris: LinkedIn and GitHub