Solved: Gemini 1.0 Pro Vision and function calling - Page 2

alex-innervate · 04-02-2024 05:55 AM

I am trying to use Vertex AI API to get structured information about image.

const vertexAI = new VertexAI({
	project: env.GOOGLE_PROJECT_ID,
	location: env.GOOGLE_LOCATION,
})

const vertexModel = vertexAI.getGenerativeModel({
	model: 'gemini-1.0-pro-vision',
})

const request: GenerateContentRequest = {
	tools: [
		{
			function_declarations: [
				{
					name: 'analyze_image',
					description:
						`Analyze what is depicted in the picture. ` +
						`Create a short and long description. ` +
						`Determine the dominant colors.`,
					parameters: {
						type: FunctionDeclarationSchemaType.OBJECT,
						properties: {
							short: {
								type: FunctionDeclarationSchemaType.STRING,
								description: 'A short description of the image',
							},
							long: {
								type: FunctionDeclarationSchemaType.STRING,
								description: 'A long description of the image',
							},
						},
						required: ['short', 'prompt'],
					},
				},
			],
		},
	],
	contents: [
		{
			role: 'user',
			parts: [
				{
					inline_data: {
						data: '...', // base64 encoded image
						mime_type: 'image/png',
					},
				},
				{ text: 'Describe in detail what is depicted in the image' },
			],
		},
	],
}

vertexModel.generateContent(request)

It works file WITHOUT function calling.

But when I try to declare a function, I always get an error:

 ClientError: [VertexAI.ClientError]: got status: 400 Bad Request. {"error":{"code":400,"message":"Unable to submit request because it contains input data that's not supported. The model supports text, text and up to 16 images, or text and 1 video. Learn more: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/gemini","status":"INVALID_ARGUMENT"}}

Any chance to use Vision model with function calls?

Poala_Tenorio

Based on the error message you provided, it seems that the issue arises because the model you are using, gemini-1.0-pro-vision, doesn't support input data in the form of function calls. The model supports either text, text and up to 16 images, or text and 1 video.

The error message also suggests referring to the documentation for more information on supported input data types.

To resolve this issue, you may need to revise your approach or choose a different model that supports the kind of input you are trying to provide. Alternatively, you might consider reaching out to Google Cloud Support for further assistance or clarification regarding the capabilities of the Vertex AI Vision model and whether it supports function calls as input.

View solution in original post