The title is self-explanatory I guess, but I will try to specify my problem a little bit further.
In my use case I am trying to use batching for an evaluation pipeline, since the output is not required to be received in real-time. Further, bc my test-set is very large I run into rate-limits of the regular API (and run into higher cost as well).
Following the documentation, I can only specify the model and input/output locations like this
Using any additional parameter - like generation_config in the regular API - throws me errors. Also function calling does not seem to be possible, which could have served as a workaround as used for previous models. The documentation does not mention anything about this nor do I find this discussed anywhere.
I also have to stress that I explicitly do not want to just validate my output afterwards (which is implemented for redundancy), but to implement this into the response generation step to begin with, making sure the evaluation pipeline is configured in the same way as the dev/production pipeline.
If this is not a current feature, how can batch predictions even be used sensibly (for anything beyond a small PoC), considering structured outputs are the only reliable way to make LLM outputs adhere to a specific format?
And as a side-note: with OpenAIs API this is possible.
Hi @davidfeiz,
Welcome to Google Cloud Community!
It looks like you are trying to specify your generation configurations or use function calling with Vertex AI BatchPredictionJob for Gemini models to ensure your structured outputs, which is a common issue.
Here are potential ways that might help with your use case:
You may refer to the documentation below, which offers pertinent information on Google Cloud’s Batch Prediction Job, custom prediction routines, and structured output:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi @MarvinLlamas,
Thank you for your reply!
Regarding your answer, I will specify a bit more how the prediction pipeline is set up:
As data we use freeform text stored as files in a gcp bucket, which represents our test-set and each file in this set contains features that we want to extract. The test-set is labelled and used to evaluate the output performance of the model (gemini-1.5-pro-002).
When using Batch predictions the input is structured as a JSONL for each batch, in which we also define a custom_id to identify the input-output pairs. The prompt we pass contains not only the task but also already specifies the desired output. The third point you suggested is also already taken care of, since we validate the response (sanity checking) as well.
What we want to achieve is to have the format-restriction mechanism also on the API level, to ensure the response itself is generated with a "restricted token sampling".
So following the documentation on controlled generation for the regular API, we need to translate any pydantic model or schema into a json first and can then pass this to the model using the generation_config parameter:
import vertexai from vertexai.generative_models import GenerationConfig, GenerativeModel # TODO(developer): Update and un-comment below line # PROJECT_ID = "your-project-id" vertexai.init(project=PROJECT_ID, location="us-central1") response_schema = { "type": "array", "items": { "type": "object", "properties": { "recipe_name": { "type": "string", }, }, "required": ["recipe_name"], }, } model = GenerativeModel("gemini-1.5-pro-002") response = model.generate_content( "List a few popular cookie recipes", generation_config=GenerationConfig( response_mime_type="application/json", response_schema=response_schema ), )
This has been copied from https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output
Trying to implement the same during the batch creation for this example looks like this:
Maybe I am not passing the parameter correctly, maybe it is not intended to be passed to begin with, but any way I tried did not work and resulted in errors. (Also tried the approach from the generativeai docs where a pydantic base model is passed inside a list). Am I missing something here?
So I managed to solve this problem. The solution was to extend the request dict like so
The accepted parameter is "generationConfig", which I just found by accident in one of the example notebooks provided in the generativeai repo 🙂
The different naming conventions for this parameter surely introduced some confusion here...
Could you link the example notebook you found this in? I am trying to solve a similar issue. Thanks!
Thanks for the code snippet - I've been looking everywhere for how to add an id to each line of a batch request!
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |