I want to use a multimodal model such as Gemini to analyze certain characteristics of images. In Vertex AI demo page, I can insert multiple images combined with text to generate a text response. Is there a way to fine-tune a model on text and images to generate a text output?
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |