Re: Clarification on How response_schema Injects J...

gagrafio · 03-20-2025 05:04 AM

Hi everyone,

I'm currently fine-tuning gemini for structured output and have hit a snag regarding consistency between fine-tuning and inference. I’d appreciate some insights on the following points:

Background:

Fine-Tuning: During fine-tuning, I include the JSON schema directly in the system prompt. For example (pseudo-prompt):

System: You are a helpful AI assistant that always responds with the following JSON schema:

{

"type": "object",

"properties": { "recipe_name": { "type": "string" } },

"required": ["recipe_name"]

}

User: List a few popular cookie recipes.

Model: <Output that adheres to the schema>

Inference: At inference time, I have the option to use the response_schema parameter. The API then automatically injects the JSON schema into the system prompt along with additional operations, likely constraint decoding. However, I don’t have visibility into the exact format of this injected prompt.

My Question:

How exactly does the response_schema parameter integrate the JSON schema into the prompt during inference? Is there any documentation or method to inspect the exact injected prompt? Given that i need to embed the JSON schema directly during fine-tuning, i want to ensure that my fine-tuning prompts are consistent with the inference-time behavior where the schema is injected automatically. Any best practices or insights to align both stages would be extremely helpful.

marckevin

Hi @gagrafio,

Welcome to Google Cloud Community!

As quoted from the documentation on Gemini Model fine tuning:

“Applying controlled generation when submitting inference requests to tuned Gemini models can result in decreased model quality due to data misalignment during tuning and inference time. During tuning, controlled generation isn't applied, so the tuned model isn't able to handle controlled generation well at inference time. Supervised fine-tuning effectively customizes the model to generate structured output. Therefore you don't need to apply controlled generation when making inference requests on tuned models.”

A response schema which is a controlled generation to specify a structure of a model’s output is not advisable to apply during inference time on a fine tuned Gemini model. Inconsistency due to data misalignment between fine tuning and inference is expected when you apply or use the response schema. During inference, the JSON schema is automatically injected into the system and there is currently no documentation detailing how the response_schema parameter integrates the JSON schema into the prompt. Due to limited visibility, extensive experimentation is required to understand and inspect the JSON schema.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

gagrafio

Thanks a lot, I had missed that section of the documentation. Things make a lot more sense now. I think my plan moving forward will be to manually include the schema structure in the prompt during fine-tuning, and then use the same schema at inference time, without relying on controlled generation.

Clarification on How response_schema Injects JSON Schema in the prompt.