Solved: Gemini 1.0 Pro Vision and function calling

alex-innervate · 04-02-2024 05:55 AM

I am trying to use Vertex AI API to get structured information about image.

const vertexAI = new VertexAI({
	project: env.GOOGLE_PROJECT_ID,
	location: env.GOOGLE_LOCATION,
})

const vertexModel = vertexAI.getGenerativeModel({
	model: 'gemini-1.0-pro-vision',
})

const request: GenerateContentRequest = {
	tools: [
		{
			function_declarations: [
				{
					name: 'analyze_image',
					description:
						`Analyze what is depicted in the picture. ` +
						`Create a short and long description. ` +
						`Determine the dominant colors.`,
					parameters: {
						type: FunctionDeclarationSchemaType.OBJECT,
						properties: {
							short: {
								type: FunctionDeclarationSchemaType.STRING,
								description: 'A short description of the image',
							},
							long: {
								type: FunctionDeclarationSchemaType.STRING,
								description: 'A long description of the image',
							},
						},
						required: ['short', 'prompt'],
					},
				},
			],
		},
	],
	contents: [
		{
			role: 'user',
			parts: [
				{
					inline_data: {
						data: '...', // base64 encoded image
						mime_type: 'image/png',
					},
				},
				{ text: 'Describe in detail what is depicted in the image' },
			],
		},
	],
}

vertexModel.generateContent(request)

It works file WITHOUT function calling.

But when I try to declare a function, I always get an error:

 ClientError: [VertexAI.ClientError]: got status: 400 Bad Request. {"error":{"code":400,"message":"Unable to submit request because it contains input data that's not supported. The model supports text, text and up to 16 images, or text and 1 video. Learn more: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/gemini","status":"INVALID_ARGUMENT"}}

Any chance to use Vision model with function calls?

Poala_Tenorio

Based on the error message you provided, it seems that the issue arises because the model you are using, gemini-1.0-pro-vision, doesn't support input data in the form of function calls. The model supports either text, text and up to 16 images, or text and 1 video.

The error message also suggests referring to the documentation for more information on supported input data types.

To resolve this issue, you may need to revise your approach or choose a different model that supports the kind of input you are trying to provide. Alternatively, you might consider reaching out to Google Cloud Support for further assistance or clarification regarding the capabilities of the Vertex AI Vision model and whether it supports function calls as input.

View solution in original post

Poala_Tenorio

Based on the error message you provided, it seems that the issue arises because the model you are using, gemini-1.0-pro-vision, doesn't support input data in the form of function calls. The model supports either text, text and up to 16 images, or text and 1 video.

The error message also suggests referring to the documentation for more information on supported input data types.

To resolve this issue, you may need to revise your approach or choose a different model that supports the kind of input you are trying to provide. Alternatively, you might consider reaching out to Google Cloud Support for further assistance or clarification regarding the capabilities of the Vertex AI Vision model and whether it supports function calls as input.

Giasin

Have you solved it?

alex-innervate

@Giasin, I can solve the original problem in two steps:

Obtain text inference from the Gemini 1.0 Pro Vision model.
Use it as a prompt for the Gemini 1.0 Pro model to get structured output through function calling.

It takes longer, but it works.

Giasin

well, thank you!
However, is there a possibility that this will compromise the initial input?
In my case, I want the stores affiliated with my project to photograph the store menu.
This menu contains data that must be processed and handled correctly. If I ask the Gemini 1.0 Pro Vision model to transcribe what it sees in the image, and it misinterprets the information, a misunderstanding may occur when I provide this data as input to the Gemini 1.0 Pro model.