Introducing Llama 4 on Vertex AI

ilnardo92 · 04-05-2025 03:41 PM

TL;DR Llama 4, the first multimodal models in the Llama family featuring a Mixture-of-Experts (MoE) architecture, is now available on Vertex AI! You can deploy Llama 4 Scout (with up to 10M token context model) and Llama 4 Maverick on Vertex AI with three lines of code using the Vertex AI Model Garden SDK.

Deploying Llama 4 in the Vertex AI console

Today, we're excited to announce that Llama 4, the latest generation of open models from Meta, is available for you to use on Vertex AI! This is a significant leap forward, especially for those of you looking to build more sophisticated and personalized multimodal applications.

Llama 4 marks the family's first multimodal models powered by a Mixture-of-Experts (MoE) architecture. What does this mean for you? MoE allows models to be very large in total parameters while only activating a subset ("experts") for any given input token, leading to more efficient training and inference. Furthermore, Llama 4 utilizes early fusion, a technique that integrates text and vision information right from the initial processing stages within a unified model backbone. This joint pre-training with text and image data allows the models to grasp complex, nuanced relationships between modalities more effectively than ever before.

Llama 4 comes in two released flavors, Scout and Maverick, giving you options based on your performance needs and resource constraints. There is also a larger "teacher" model, Behemoth, currently in training.

Llama 4 Scout

Architecture: MoE with 17 billion active parameters, 16 experts, and 109 billion total parameters.
Availability: Both pretrained (PT) and instructed (IT) models are available
Performance: Delivers state-of-the-art results for its size class, outperforming previous Llama generations and other open and proprietary models on various benchmarks.
Key Feature: A massive, industry-leading 10 million token context window (up significantly from Llama 3's 128k).
Best for retrieval tasks within long contexts and tasks demanding reasoning over vast amounts of information, such as summarizing multiple large documents, analyzing extensive user interaction logs for personalization, and reasoning across large codebases.

Llama 4 Maverick

Architecture: MoE with 17 billion active parameters, 128 experts, and 400 billion total parameters. Uses alternating dense and MoE layers, where each token activates a shared expert plus one of the 128 routed experts.
Availability: Both pretrained (PT) and instructed (IT, with FP8 support) models are available
Performance: The largest and most capable Llama 4 model released so far, offering industry-leading capabilities on coding, reasoning, and image benchmarks, while being competitive with DeepSeek v3.1. Provides a best-in-class performance-to-cost ratio.
Key Features: Native multimodality with a 1M context length. Optimized for high-quality chat interactions through a refined post-training pipeline (lightweight SFT > online RL > lightweight DPO). Pre-trained on 200 languages for broad fine-tuning potential.
Best for advanced image captioning, analysis, precise image understanding, and visual Q&A, creative text generation, general-purpose AI assistants and sophisticated chatbots requiring top-tier intelligence and image understanding.

To help developers create safe and useful Llama-supported applications and reduce the risk of adversarial failures, both models incorporate tunable system-level and multi-layered mitigations at each stage of development, from pre-training to post-training.

Get Started with Llama 4 on Vertex AI Model Garden

The easiest way to get Llama 4 up and running is through the Vertex AI Model Garden.

We've streamlined the deployment process – you can deploy an optimized Llama 4 endpoint with just a few lines of code using the Vertex AI Model Garden SDK.

Here’s a quick example of how to deploy the Llama 4 Scout Instruct model. First you initialize the OpenModel instance using the associated model ID. You can find the model ID in the Vertex AI Model Garden UI or using the list_deployable_models method. Then you start the deployment process.

# pip install 'google-cloud-aiplatform>=1.84.0' 'openai' 'google-auth' 'requests'

import vertexai
from vertexai.preview import model_garden

vertexai.init(project="your-project-id", location="your-region")

llama4_model = model_garden.OpenModel("meta/llama4@llama-4-scout-17b-16e-instruct")
llama4_endpoint = llama4_model.deploy(accept_eula=True)

By default, the model will use the deployment recipe Vertex AI Model Garden provides. You can review the recipe using the list_deploy_options method associated with your OpenModel instance.

After you start the deployment process, you can monitor the deployment process from the Vertex AI Prediction as shown below.

Screenshot of model deployment inVertex AIDeploying the model in this case would take ~ 20 mins. After the model is deployed, you can use both Vertex AI API for Python (here an example) or Chat Completions API to start using Llama 4. Below an example of how to use Llama 4 Scout for a simple image captioning task.

import google.auth
import openai

creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

ENDPOINT_RESOURCE_NAME = 'projects/{your-project-id}/locations/{your-endpoint-region}/endpoints/{your-endpoint-id'

BASE_URL = (
    f"https://{your-region}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
)

client = openai.OpenAI(base_url=BASE_URL, api_key=creds.token)

model_response = client.chat.completions.create(
    model="",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/The_Blue_Marble_%28remastered%29.jpg/580px-The_Blue_Marble_%28remastered%29.jpg"}},
                {"type": "text", "text": "What is in the image?"},
            ],
        }
    ],
    temperature=0,
    max_tokens=50,
)
print(model_response)
# The image presents a stunning visual representation of Earth, showcasing its diverse geography and atmospheric features. The planet's surface is predominantly blue, with swirling white clouds scattered across the oceans, while the landmasses are visible in shades of brown and gray, set against the inky blackness of space.

# The image presents a stunning visual representation of Earth, showcasing its diverse geography and atmospheric features. The planet's surface is predominantly blue, with swirling white clouds scattered across the oceans, while the landmasses are visible in shades of brown and gray, set against the inky blackness of space.

What’s next

To start using Llama 4 models on Vertex AI, here few suggestions:

Explore Model Garden: Browse the available Llama 4 models directly in the Vertex AI Model Garden.
Try a Tutorial: Check out the official Vertex AI sample notebooks on GitHub.
Read the Docs: Familiarize yourself with the Vertex AI documentation for using Llama models.
Contribute: If you want to share your projects using Llama 4 or any other open models, follow the contribution guideline of our Gen AI repo and get them in our open models folder.

Thanks for reading

Thank you for reading! I encourage you to connect and reach out on Linkedin and X to share feedback, questions and what you build on Vertex AI.