Fine-Tuning with Images and Text?!

Capybara112 · 08-19-2024 10:54 PM

I want to use a multimodal model such as Gemini to analyze certain characteristics of images. In Vertex AI demo page, I can insert multiple images combined with text to generate a text response. Is there a way to fine-tune a model on text and images to generate a text output?

dawnberdan

Hi @Capybara112,

Welcome to Google Cloud Community!

Yes! You have the option to fine-tune a multimodal model to generate text outputs by utilizing both text and images. To do this, you will need to compile a dataset that contains paired images and text. You can use a platform such as Vertex AI to train the model. During this process, you will need to specify your training objectives and adjust certain parameters to align the model with your specific requirements.

For further information regarding the fine-tuning process, you may refer to this document.

I hope the above information is helpful.