Output Truncated When Setting maxOutputTokens

TheRealMikeD · 04-11-2025 04:58 PM

Hi,

I'm using vertexAI for firebase, and mostly using the gemini-2.0-flash model (although I have noticed the same behavior that I'm about to describe using gemini-2.0-flash-lite). When I set maxOutputTokens to something like 100, the response gets truncated, often in the middle of a sentence or a paragraph.

My understanding was that a lower setting of maxOutputTokens could be used to keep the model's response relatively brief. But it doesn't seem to actually have any effect on the text that is being generated; instead, the model (or the API) just doesn't seem to be returning all of the generated text.

My code (in React Native) looks something like this:

	const genConf = {
		temperature: 0.5,
		maxOutputTokens: 100,
		topP: 0.8,
		topK: 40
	};
			const firebaseApp = getApp();
			const vertexAI = getVertexAI(firebaseApp);
			const mdlPrms = {
				model: 'gemini-2.0-flash',
				generationConfig: genConf
			};
			const generativeModel = getGenerativeModel(vertexAI, mdlPrms);
			const chatPrms = {
				generationConfig: genConf
			};
			const chat = generativeModel.startChat(chatPrms);
			const chatSubmission = `What is the origin of the universe?`;
			const result = await chatSessionLocal.sendMessageStream(chatSubmission);

Through trial and error, I've found that including maxOutputTokens in the params to the call getGenerativeModel() doesn't seem to have any effect at all. But including it in the params to generativeModel.startChat() definitely has an effect. It's just not the effect I was expecting.

Am I misunderstanding how maxOutputTokens is supposed to work? Is there some other way to keep responses brief? Maybe using systemInstructions?

Thanks!

MarvinLlamas

Hi @TheRealMikeD,

Welcome to the Google Cloud Community!

It looks like you're facing an issue where your responses from the Gemini API get truncated when using ‘maxOutputTokens’ to limit their length. You're trying to keep your outputs brief, but the API cuts off your text mid-sentence or paragraph, making your responses incomplete.

Here are the potential ways that might help with your use case:

Fine-Tune with systemInstructions: To control the length and style of your response, you may want to fine-tune with systemInstructions. These provide the model with a general directive on how to behave, effectively managing the response length.

Prompt Engineering: When you’re writing prompts, keep it simple and clear. Ask direct questions that naturally lead to short answers. For example, instead of saying 'Explain the theory of relativity,' try 'What’s the main idea behind the theory of relativity?.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

TheRealMikeD

Hi @MarvinLlamas. Thanks for the reply.

That's somewhat helpful. It does seem like systemInstructions might be my best choice for keeping responses shorter. But does that mean that maxOutputTokens doesn't really change how the model generates the response? If that's true, then I'm a little puzzled as to why the maxOutputTokens setting exists at all. It doesn't seem like cutting the response off in the middle of a sentence would ever be useful.

The prompts are entered by the users of my app, so I don't have much control over them. I suppose I could impose a maximum character count.

Thanks.