Re: Streaming Response Support for Custom vLLM Con...

aslam_regobs

Hey everyone,

I'm currently exploring Vertex AI for custom model deployment and trying to set up LLaMA 3 using vLLM. Here's what I’ve done so far:

Created a custom container with vLLM and the LLaMA 3 model
Registered the model in the Vertex AI Model Registry.
Deployed the model to a Vertex AI Endpoint.

The setup works fine for standard (non-streaming) text generation. However, I’d really like to stream the output token by token, similar to how OpenAI/Anthropic APIs work.

When I tried using my vLLM streaming endpoint, I got this error:

{
"error": {
"code": 400,
"message": "The output data is not valid JSON. Original output (truncated): 
data: {\"text\": \" and\"}
data: {\"text\": \" trees\"}
...
data: {\"text\": \"-h\"}",
"status": "FAILED_PRECONDITION"
}
}

I’m trying to figure out if there’s a way for Vertex AI to handle text/event-stream responses directly from a custom container. So far, it looks like the predict endpoint expects a single valid JSON object, which doesn’t play well with vLLM’s streaming format.

My questions:

Is there a way to configure Vertex AI endpoints to handle text/event-stream responses from custom containers?
Has anyone successfully implemented streaming with vLLM on Vertex AI?

Any tips, documentation references, or example implementations would be greatly appreciated. Thanks in advance!

MarvinLlamas

Hi @aslam_regobs,

Welcome to the Google Cloud Community!

It looks like you are encountering an incompatibility between the text/event-stream format used by your vLLM container for token-by-token streaming and the Vertex AI predict endpoint, which expects a single JSON object. This mismatch causes parsing failures when Vertex AI encounters multi-line SSE data (`data: {...}\n\n`).

Here are the potential ways that might help with your use case:

Deploying vLLM on Cloud Run: Deploy your custom vLLM container with native support for text/event-stream. It offers seamless serverless scalability, including GPU compatibility, ensuring a cost-effective solution while avoiding the constraints of the Vertex AI predict endpoint.
Deploy to Google Kubernetes Engine (GKE): If you need full control, custom scaling, and top-tier throughput, you can use Google Kubernetes Engine (GKE). It gives you solid GPU support and plenty of flexibility, but just a heads-up, it comes with a bit more operational complexity.

In addition, To answer your questions:

Is there a way to configure Vertex AI endpoints to handle text/event-stream responses from custom containers? No, The standard `predict` method of a Vertex AI Model Endpoint is designed exclusively for request-response interactions, requiring a terminal JSON payload. It does not support streaming responses via SSE or WebSocket, nor does it function as a proxy for such capabilities within a custom container.
Has anyone successfully implemented streaming with vLLM on Vertex AI? Yes, though not via the standard Vertex AI Model Endpoint's predict interface. A typical solution involves hosting the vLLM HTTP server which supports streaming on a platform optimized for direct HTTP/SSE accessibility.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Streaming Response Support for Custom vLLM Container on Vertex AI?