Hey everyone,
I'm currently exploring Vertex AI for custom model deployment and trying to set up LLaMA 3 using vLLM. Here's what I’ve done so far:
Created a custom container with vLLM and the LLaMA 3 model
The setup works fine for standard (non-streaming) text generation. However, I’d really like to stream the output token by token, similar to how OpenAI/Anthropic APIs work.
When I tried using my vLLM streaming endpoint, I got this error:
{
"error": {
"code": 400,
"message": "The output data is not valid JSON. Original output (truncated):
data: {\"text\": \" and\"}
data: {\"text\": \" trees\"}
...
data: {\"text\": \"-h\"}",
"status": "FAILED_PRECONDITION"
}
}
I’m trying to figure out if there’s a way for Vertex AI to handle text/event-stream responses directly from a custom container. So far, it looks like the predict endpoint expects a single valid JSON object, which doesn’t play well with vLLM’s streaming format.
My questions:
Any tips, documentation references, or example implementations would be greatly appreciated. Thanks in advance!
Hi @aslam_regobs,
Welcome to the Google Cloud Community!
It looks like you are encountering an incompatibility between the text/event-stream format used by your vLLM container for token-by-token streaming and the Vertex AI predict endpoint, which expects a single JSON object. This mismatch causes parsing failures when Vertex AI encounters multi-line SSE data (`data: {...}\n\n`).
Here are the potential ways that might help with your use case:
In addition, To answer your questions:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.