Fine-tuning Gemini Model with x2 images and text a... - Page 2

TR-WW · 10-14-2024 09:55 PM

Hi there,

I'm trying to create a fine tuned Gemini model where the process takes 2 images and text as input, and an output as text. Essentially object detection, e.g. 'is image 1 found in image 2?'

According to documentation within 'Generative AI on Vertex AI > Documentation'...

"Visual inspection: Train a model to identify specific objects or defects within images, automating quality control or inspection processes."
- https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune_gemini/image_tune

And, the example training dataset only provides one image as within parts for the user role:

{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "mimeType": "image/jpeg",
            "fileUri": "gs://cloud-samples-data/ai-platform/generative_ai/gemini-1_5/image/longcap100/100.jpeg"
            }
        }, 
        {
          "text": "Describe this image in detail that captures the essence of it."
        }
      ]
    }, 
    {
      "role": "model",
      "parts": [
        {
          "text": "A man stands on a road, wearing a blue denim jacket, tan pants, and white sneakers. He has his hands in his pockets and is wearing a white t-shirt under his jacket. The man's pants are cuffed, and his shoes are white. The road is dark grey, and the leaves are green. The man is standing in the shade, and the light is shining on the ground."
        }
      ]
    }
  ]
}

- https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune_gemini/image_tune#dataset_format

Now moving to Google AI Studio, I can use 2 images and text as the input, however this is only a manual job. It's not ideal to do this for up to 500 records manually.

Thereafter, generating the example code for this training dataset, a non-vertex library is used, rather Gemini API.

"""
Install the Google AI Python SDK

$ pip install google-generativeai
"""

import os
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

def upload_to_gemini(path, mime_type=None):
  """Uploads the given file to Gemini.

  See https://ai.google.dev/gemini-api/docs/prompting_with_media
  """
  file = genai.upload_file(path, mime_type=mime_type)
  print(f"Uploaded file '{file.display_name}' as: {file.uri}")
  return file

# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-pro",
  generation_config=generation_config,
  # safety_settings = Adjust safety settings
  # See https://ai.google.dev/gemini-api/docs/safety-settings
)

# TODO Make these files available on the local file system
# You may need to update the file paths
files = [
  upload_to_gemini("image_architecture2.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_animal1.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_architecture2.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_objects2.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_food4.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_space3.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_transportation3.jpeg", mime_type="image/jpeg"),
]

response = model.generate_content([
  "Tasks:1. Analyze image 1 and image 2. Return Found if image 2 is in image 1, and Not Found if not.",
  "input: Image 1: \n",
  files[0],
  "\nImage 2:\n",
  files[1],
  "output: Not found",
  "input: Image 1:\n",
  files[2],
  "\nImage 2:\n",
  files[3],
  "output: Not found",
  "input: Image 1:\n",
  files[4],
  "\nImage 2:\n",
  files[5],
  "output: Found",
  "input: Image 1:\n",
  files[6],
  "\nImage 2:",
  "output: ",
  "input: ",
  "output: ",
])

print(response.text)

It is possible to achieve this fine tuning with 2 images and text as input on vertex AI? There is limited documentation about using 2 images and text as input.

Any direction or guidance appreciated.

Tom.

Fine-tuning Gemini Model with x2 images and text as input