The evolving ecosystem of open-source Large Language Models (LLMs) is enabling developers to implement transformative great use cases. From advanced text generation to complex reasoning tasks, these models offer a great foundation for building AI applications. But how do you move these open models from experimentation to production efficiently?
Deploying open LLMs at scale presents significant challenges. Developers often struggle with achieving the low latency and high throughput required for production workloads, especially with larger models. Manually optimizing models for specific hardware can be a complex, time-consuming process, and managing the infrastructure for cost-effective serving represent some of the challenges they may face.
To address these challenges, in collaboration partnership with NVIDIA, we're thrilled to announce the Vertex AI prebuilt NVIDIA TensorRT-LLM container optimized for DeepSeek-V3, DeepSeek-R1, and Llama 3.3 70B Instruct with FP8 precision on NVIDIA Hopper and H200 GPUs in Vertex AI Model Garden. This powerful integration brings the NVIDIA open-sourced library for optimizing Large Language Model (LLM) inference directly into Vertex AI Prediction, enabling you to serve open models with 1-click deployment and getting significantly improved performance and cost-efficiency.
With NVIDIA’s TensorRT-LLM library, customers can now achieve up to ~45% higher throughput (output tokens/sec) and ~40% lower latency (time-to-first-token) on average for DeepSeek- R1 and Llama 3.3 70B on Google Cloud Vertex AI.
Deploying your own DeepSeek- R1 model instance on Vertex AI with NVIDIA TensorRT-LLM is a streamlined process. Here’s how you can get started through the Google Cloud console. Begin by heading to the Vertex AI Model Garden, your central hub for discovering, browsing, and deploying models. Here you'll find models from Google and other providers. You can search for specific models like DeepSeek-R1.
Once you choose DeepSeek-R1, Vertex AI Model Garden provides an overview of the model, its use cases, documentation, and pricing information.
With DeepSeek-R1 selected, you can proceed with a "one-click-deployment" to a Vertex AI endpoint. This feature uses recommended settings to simplify the deployment process, though custom deployment options are also available for more advanced use cases. During the deployment process, you can select the TensorRT-LLM container option. This ensures your model is served with the advanced performance optimizations—including powerful speculative decoding, optimized kernels, and many more— provided by NVIDIA's library. You will also see your available quota for the necessary hardware, such as NVIDIA H200 GPUs.
Once you start the deployment, the system notifies you as the model is getting uploaded and deployed to an endpoint.
Once deployed, Vertex AI manages the model on an endpoint that can be accessed for online predictions. You can view and manage your model in the Vertex AI Model Registry.
It is important to highlight that this entire deployment process can be handled programmatically using the Vertex AI Model Garden SDK as shown below.
import vertexai
from vertexai import model_garden
vertexai.init(project="your-project", location="your-location")
model = model_garden.OpenModel("deepseek-ai/deepseek-r1@deepseek-r1")
endpoint = model.deploy(
accept_eula=True,
machine_type="a3-ultragpu-8g",
accelerator_type="NVIDIA_H200_141GB",
accelerator_count=8,
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/tensorrt-llm.cu128.0-18.ubuntu2404.py312:deepseek",
endpoint_display_name="deepseek-ai_deepseek-r1-mg-one-click-deploy",
model_display_name="deepseek-ai_deepseek-r1-1750971292426",
use_dedicated_endpoint=True,
)
Finally, you can send prediction requests directly through the Cloud console or the Vertex AI API. Here is an example of how to generate text by providing a prompt.
from google.cloud import aiplatform
deepseek_endpoint = aiplatform.Endpoint('your-endpoint-resource-name')
user_message = "How many r's are in strawberry ?"
max_tokens = 50
temperature = 1.0
response = deepseek_endpoint.raw_predict(
body=json.dumps(
{
"model": "",
"messages": [
{
"role": "user",
"content": user_message,
}
],
"max_tokens": max_tokens,
"temperature": temperature,
}
),
headers={"Content-Type": "application/json"},
use_dedicated_endpoint=use_dedicated_endpoint,
)
print(response.json()["choices"][0]["message"]["content"])
# First, the question is: "How many r's are in strawberry?"
# I need to count the number of times the letter 'r' appears in the word "strawberry"...
Based on internal tests, with NVIDIA’s TensorRT-LLM library on Vertex AI, you can now achieve up to ~45% higher throughput (output tokens/sec) with respect to other popular inference open-source options (baseline) as shown below.
Also, you can achieve ~40% lower latency (time-to-first-token) on average for DeepSeek- R1 and Llama 3.3 70B.
The new Vertex AI prebuilt container with NVIDIA TensorRT-LLM provides a powerful and, yet easy-to-use solution for deploying open LLMs with improved performance and cost-efficiency. By integrating the advanced optimization capabilities of NVIDIA TensorRT-LLM directly into the Vertex AI platform, we enable developers to deploy faster and build more scalable AI applications.
Want to learn more and get started with the Vertex AI prebuilt container with NVIDIA TensorRT-LLM?
Check out the following resources: