Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Structured Output in vertexAI BatchPredictionJob

The title is self-explanatory I guess, but I will try to specify my problem a little bit further.

In my use case I am trying to use batching for an evaluation pipeline, since the output is not required to be received in real-time. Further, bc my test-set is very large I run into rate-limits of the regular API (and run into higher cost as well). 
Following the documentation, I can only specify the model and input/output locations like this

Bildschirmfoto 2025-01-19 um 17.31.54.png
Using any additional parameter - like generation_config in the regular API - throws me errors. Also function calling does not seem to be possible, which could have served as a workaround as used for previous models. The documentation does not mention anything about this nor do I find this discussed anywhere.
I also have to stress that I explicitly do not want to just validate my output afterwards (which is implemented for redundancy), but to implement this into the response generation step to begin with, making sure the evaluation pipeline is configured in the same way as the dev/production pipeline. 

If this is not a current feature, how can batch predictions even be used sensibly (for anything beyond a small PoC), considering structured outputs are the only reliable way to make LLM outputs adhere to a specific format?

And as a side-note: with OpenAIs API this is possible. 

0 5 3,256
5 REPLIES 5

Hi @davidfeiz,

Welcome to Google Cloud Community!

It looks like you are trying to specify your generation configurations or use function calling with Vertex AI BatchPredictionJob for Gemini models to ensure your structured outputs, which is a common issue.

Here are potential ways that might help with your use case:

  • Experiment with input format: You may try different data formats (such as JSONL) and organize your data in ways that might enable you to pass parameters.
  • Prompt Engineering: Make sure you precisely construct your prompts to direct your LLM towards the desired structured output. This method depends significantly on the model's ability to consistently understand your instructions.
  • Validation and Post-processing: You may want to process your raw outputs to extract and structure the relevant information after your batch prediction job completes. This method adds complexity, computational overhead, and increases the risk of errors compared to a more controlled process.
  • Consider Custom Prediction Routine: You may want to explore using a custom prediction routine. By creating a custom container with your prediction code, you'll gain more control over how your model is called and how your output is formatted.

You may refer to the documentation below, which offers pertinent information on Google Cloud’s Batch Prediction Job, custom prediction routines, and structured output:

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

 

 

 

Hi @MarvinLlamas,

Thank you for your reply!

Regarding your answer, I will specify a bit more how the prediction pipeline is set up:
As data we use freeform text stored as files in a gcp bucket, which represents our test-set and each file in this set contains features that we want to extract. The test-set is labelled and used to evaluate the output performance of the model (gemini-1.5-pro-002). 
When using Batch predictions the input is structured as a JSONL for each batch, in which we also define a custom_id to identify the input-output pairs. The prompt we pass contains not only the task but also already specifies the desired output. The third point you suggested is also already taken care of, since we validate the response (sanity checking) as well. 

What we want to achieve is to have the format-restriction mechanism also on the API level, to ensure the response itself is generated with a "restricted token sampling". 
So following the documentation on controlled generation for the regular API, we need to translate any pydantic model or schema into a json first and can then pass this to the model using the generation_config parameter:

import vertexai

from vertexai.generative_models import GenerationConfig, GenerativeModel

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

response_schema = {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "recipe_name": {
                "type": "string",
            },
        },
        "required": ["recipe_name"],
    },
}

model = GenerativeModel("gemini-1.5-pro-002")

response = model.generate_content(
    "List a few popular cookie recipes",
    generation_config=GenerationConfig(
        response_mime_type="application/json", response_schema=response_schema
    ),
)

This has been copied from https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output

Trying to implement the same during the batch creation for this example looks like this:

davidfeiz_2-1737896133467.png

Maybe I am not passing the parameter correctly, maybe it is not intended to be passed to begin with, but any way I tried did not work and resulted in errors. (Also tried the approach from the generativeai docs where a pydantic base model is passed inside a list). Am I missing something here?

So I managed to solve this problem. The solution was to extend the request dict like so

davidfeiz_0-1738156915197.png

The accepted parameter is "generationConfig", which I just found by accident in one of the example notebooks provided in the generativeai repo 🙂
The different naming conventions for this parameter surely introduced some confusion here...

Could you link the example notebook you found this in? I am trying to solve a similar issue. Thanks!

Thanks for the code snippet - I've been looking everywhere for how to add an id to each line of a batch request!