The title is self-explanatory I guess, but I will try to specify my problem a little bit further.
In my use case I am trying to use batching for an evaluation pipeline, since the output is not required to be received in real-time. Further, bc my test-set is very large I run into rate-limits of the regular API (and run into higher cost as well).
Following the documentation, I can only specify the model and input/output locations like this
Using any additional parameter - like generation_config in the regular API - throws me errors. Also function calling does not seem to be possible, which could have served as a workaround as used for previous models. The documentation does not mention anything about this nor do I find this discussed anywhere.
I also have to stress that I explicitly do not want to just validate my output afterwards (which is implemented for redundancy), but to implement this into the response generation step to begin with, making sure the evaluation pipeline is configured in the same way as the dev/production pipeline.
If this is not a current feature, how can batch predictions even be used sensibly (for anything beyond a small PoC), considering structured outputs are the only reliable way to make LLM outputs adhere to a specific format?
And as a side-note: with OpenAIs API this is possible.
Hi @davidfeiz,
Welcome to Google Cloud Community!
It looks like you are trying to specify your generation configurations or use function calling with Vertex AI BatchPredictionJob for Gemini models to ensure your structured outputs, which is a common issue.
Here are potential ways that might help with your use case:
You may refer to the documentation below, which offers pertinent information on Google Cloud’s Batch Prediction Job, custom prediction routines, and structured output:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi @MarvinLlamas,
Thank you for your reply!
Regarding your answer, I will specify a bit more how the prediction pipeline is set up:
As data we use freeform text stored as files in a gcp bucket, which represents our test-set and each file in this set contains features that we want to extract. The test-set is labelled and used to evaluate the output performance of the model (gemini-1.5-pro-002).
When using Batch predictions the input is structured as a JSONL for each batch, in which we also define a custom_id to identify the input-output pairs. The prompt we pass contains not only the task but also already specifies the desired output. The third point you suggested is also already taken care of, since we validate the response (sanity checking) as well.
What we want to achieve is to have the format-restriction mechanism also on the API level, to ensure the response itself is generated with a "restricted token sampling".
So following the documentation on controlled generation for the regular API, we need to translate any pydantic model or schema into a json first and can then pass this to the model using the generation_config parameter:
import vertexai from vertexai.generative_models import GenerationConfig, GenerativeModel # TODO(developer): Update and un-comment below line # PROJECT_ID = "your-project-id" vertexai.init(project=PROJECT_ID, location="us-central1") response_schema = { "type": "array", "items": { "type": "object", "properties": { "recipe_name": { "type": "string", }, }, "required": ["recipe_name"], }, } model = GenerativeModel("gemini-1.5-pro-002") response = model.generate_content( "List a few popular cookie recipes", generation_config=GenerationConfig( response_mime_type="application/json", response_schema=response_schema ), )
This has been copied from https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output
Trying to implement the same during the batch creation for this example looks like this:
Maybe I am not passing the parameter correctly, maybe it is not intended to be passed to begin with, but any way I tried did not work and resulted in errors. (Also tried the approach from the generativeai docs where a pydantic base model is passed inside a list). Am I missing something here?
So I managed to solve this problem. The solution was to extend the request dict like so
The accepted parameter is "generationConfig", which I just found by accident in one of the example notebooks provided in the generativeai repo 🙂
The different naming conventions for this parameter surely introduced some confusion here...
Could you link the example notebook you found this in? I am trying to solve a similar issue. Thanks!
Thanks for the code snippet - I've been looking everywhere for how to add an id to each line of a batch request!
I am facing the similar issue for generate the strucutre output we have to pass the output scheme,but i don't know how i can pass it for the batch prediction. Google Documentation regarding this Hell. Can you please the reference or notebook for controlled generation via structure.
{"status":"","processed_time":"2025-05-21T07:53:38.797+00:00","request":{"contents":[{"parts":[{"file_data":null,"text":"create the metadata from this file."},{"file_data":{"file_uri":"gs://foley-sound-bcc-sound-effect-wav/BBCSoundEffectsWavs/00008000.wav","mime_type":"audio/wav"},"text":null}],"role":"user"}],"generationConfig":{"temperature":0.4}},"response":{"candidates":[{"avgLogprobs":-0.3753753984478158,"content":{"parts":[{"text":"Here's a possible set of metadata for the audio file, based on the content:\n\n**General Metadata:**\n\n* **Description:** Audio recording of children playing and talking, possibly in an outdoor environment.\n* **Keywords:** Children, play, voices, outdoor, speech, noise, activity.\n\n**Technical Metadata:**\n\n* **File Format:** (Assuming it's a common format like) .mp3, .wav, .aac\n* **Duration:** (Needs to be determined by analyzing the file length)\n* **Channels:** Mono or Stereo (likely mono if recorded with a single microphone)\n* **Sample Rate:** (Needs to be determined by analyzing the file)\n* **Bit Rate:** (Needs to be determined by analyzing the file)\n\n**Contextual Metadata (If available):**\n\n* **Location:** (If known, where the recording was made)\n* **Date and Time:** (If known, when the recording was made)\n* **Recorder:** (If known, who made the recording)\n* **Purpose:** (Why the recording was made, if known)\n\n**Notes:**\n\n* The metadata above is based on the audio content and common practices.\n* To get accurate technical metadata, you need to analyze the audio file using audio editing software or metadata analysis tools."}],"role":"model"},"finishReason":"STOP"}],"createTime":"2025-05-21T07:53:38.869209Z","modelVersion":"gemini-2.0-flash-001@default","responseId":"goYtaNmGNZizgLUPp93-2A4","usageMetadata":{"candidatesTokenCount":284,"candidatesTokensDetails":[{"modality":"TEXT","tokenCount":284}],"promptTokenCount":732,"promptTokensDetails":[{"modality":"AUDIO","tokenCount":725},{"modality":"TEXT","tokenCount":7}],"totalTokenCount":1016,"trafficType":"ON_DEMAND"}}}
{"status":"Bad Request: {\"error\": {\"code\": 400, \"message\": \"Request contains an invalid argument.\", \"status\": \"INVALID_ARGUMENT\"}}","processed_time":"2025-05-21T07:53:38.799+00:00","request":{"contents":[{"parts":[{"file_data":null,"text":"create the metadata from this file."},{"file_data":{"file_uri":"gs://foley-sound-bcc-sound-effect-wav/BBCSoundEffectsWavs/00008001.wav","mime_type":"audio/wav"},"text":null}],"role":"user"}],"generationConfig":{"temperature":0.4}},"response":{}}
I think this is the notebook: https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/...