Re: Fine-tuning Gemini Model with x2 images and te...

TR-WW · 10-14-2024 09:55 PM

Hi there,

I'm trying to create a fine tuned Gemini model where the process takes 2 images and text as input, and an output as text. Essentially object detection, e.g. 'is image 1 found in image 2?'

According to documentation within 'Generative AI on Vertex AI > Documentation'...

"Visual inspection: Train a model to identify specific objects or defects within images, automating quality control or inspection processes."
- https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune_gemini/image_tune

And, the example training dataset only provides one image as within parts for the user role:

{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "mimeType": "image/jpeg",
            "fileUri": "gs://cloud-samples-data/ai-platform/generative_ai/gemini-1_5/image/longcap100/100.jpeg"
            }
        }, 
        {
          "text": "Describe this image in detail that captures the essence of it."
        }
      ]
    }, 
    {
      "role": "model",
      "parts": [
        {
          "text": "A man stands on a road, wearing a blue denim jacket, tan pants, and white sneakers. He has his hands in his pockets and is wearing a white t-shirt under his jacket. The man's pants are cuffed, and his shoes are white. The road is dark grey, and the leaves are green. The man is standing in the shade, and the light is shining on the ground."
        }
      ]
    }
  ]
}

- https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune_gemini/image_tune#dataset_format

Now moving to Google AI Studio, I can use 2 images and text as the input, however this is only a manual job. It's not ideal to do this for up to 500 records manually.

Thereafter, generating the example code for this training dataset, a non-vertex library is used, rather Gemini API.

"""
Install the Google AI Python SDK

$ pip install google-generativeai
"""

import os
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

def upload_to_gemini(path, mime_type=None):
  """Uploads the given file to Gemini.

  See https://ai.google.dev/gemini-api/docs/prompting_with_media
  """
  file = genai.upload_file(path, mime_type=mime_type)
  print(f"Uploaded file '{file.display_name}' as: {file.uri}")
  return file

# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-pro",
  generation_config=generation_config,
  # safety_settings = Adjust safety settings
  # See https://ai.google.dev/gemini-api/docs/safety-settings
)

# TODO Make these files available on the local file system
# You may need to update the file paths
files = [
  upload_to_gemini("image_architecture2.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_animal1.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_architecture2.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_objects2.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_food4.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_space3.jpeg", mime_type="image/jpeg"),
  upload_to_gemini("image_transportation3.jpeg", mime_type="image/jpeg"),
]

response = model.generate_content([
  "Tasks:1. Analyze image 1 and image 2. Return Found if image 2 is in image 1, and Not Found if not.",
  "input: Image 1: \n",
  files[0],
  "\nImage 2:\n",
  files[1],
  "output: Not found",
  "input: Image 1:\n",
  files[2],
  "\nImage 2:\n",
  files[3],
  "output: Not found",
  "input: Image 1:\n",
  files[4],
  "\nImage 2:\n",
  files[5],
  "output: Found",
  "input: Image 1:\n",
  files[6],
  "\nImage 2:",
  "output: ",
  "input: ",
  "output: ",
])

print(response.text)

It is possible to achieve this fine tuning with 2 images and text as input on vertex AI? There is limited documentation about using 2 images and text as input.

Any direction or guidance appreciated.

Tom.

ruthseki

Hi @TR-WW,

Welcome to Google Cloud Community!

You can add images to Gemini requests to perform tasks that involve understanding the contents of the included images. This page shows you how to add images to your requests to Gemini in Vertex AI by using the Google Cloud console and the Vertex AI API.

I hope the above information is helpful.

MLskeri

Hi, Ruth, what about creating the JSONL formatting for validation testing? Do we not include the "role": for the model and just provide the input of the two images?

TR-WW

Are you using two images in your user role?

MLskeri

Hey, TR-WW, yes, I need to input two images into the user role. This is an example I have prepared but there's still an issue. Its mentioning missing 'Contents' field still. Might need apostrophes instead of quotes?

{"contents": [{"role": "user", "parts": [{"fileData": {"mimeType": "image/png", "fileUri": "gs://human_rated_tactile_map_similarity_scores/NEINV002_220318_162612_DrawFullMap_01_resized.png"}}, {"text": "Image 1 (hand drawing)"}, {"fileData": {"mimeType": "image/png", "fileUri": "gs://human_rated_tactile_map_similarity_scores/Map01_solution0.png"}}, {"text": "Image 2 (template)"}, {"text": "Compare these two images and provide a similarity score from 0 to 10, where 0 is completely different and 10 means all local features are in the right locations relative to each other. Only respond with 'Similarity Score:' followed by the numeric score."}]}, {"role": "model", "parts": [{"text": "Similarity Score: 3"}]}]}
{"contents": [{"role": "user", "parts": [{"fileData": {"mimeType": "image/png", "fileUri": "gs://human_rated_tactile_map_similarity_scores/NEINV002_220318_162612_DrawFullMap_02_resized.png"}}, {"text": "Image 1 (hand drawing)"}, {"fileData": {"mimeType": "image/png", "fileUri": "gs://human_rated_tactile_map_similarity_scores/Map01_solution0.png"}}, {"text": "Image 2 (template)"}, {"text": "Compare these two images and provide a similarity score from 0 to 10, where 0 is completely different and 10 means all local features are in the right locations relative to each other. Only respond with 'Similarity Score:' followed by the numeric score."}]}, {"role": "model", "parts": [{"text": "Similarity Score: 4"}]}]}

TR-WW

Hi @ruthseki,

I appreciate your response, however the information you're referring to is a standard use of the API.

I'm referring to fine tuning a Gemini model, and my issue relies around using 2 images as the input for training data. This seems possible on Google AI studio (manually) however not possible via Vertex AI when passing a JSONL training dataset.

Please kindly review my post and confirm.

MLskeri

Hi, Tom,

I was able to get my two image inputs and text prompt working on the fine-tuning for Gemini-1.5-Pro-002, however, I was on free-trial and it disintegrated the entire free $300. I was letting it run over the weekend so I do not know the result of it...

I will need to figure out a way to do this without being billed. Please let me know if you have any idea. I will provide an example of the working JSONL structure below:

Here is what the JSONL structure for the working tuning looked like for my dataset, need to make sure the end of line sequence is in 'LF' mode and not 'CRLF':

{"contents": [{"role": "user", "parts": [{"fileData": {"mimeType": "image/png", "fileUri": "gs://human_rated_tactile_map_similarity_scores/NEINV002_220318_162612_DrawFullMap_01_resized.png"}}, {"text": "Image 1 (hand drawing)"}, {"fileData": {"mimeType": "image/png", "fileUri": "gs://human_rated_tactile_map_similarity_scores/Map01_solution0.png"}}, {"text": "Image 2 (template)"}, {"text": "Compare these two images and provide a similarity score from 0 to 10, where 0 is completely different and 10 means all local features are in the right locations relative to each other. Only respond with 'Similarity Score:' followed by the numeric score."}]}, {"role": "model", "parts": [{"text": "Similarity Score: 3"}]}]}

Some images of the tuning going through before I left for the weekend:

TR-WW

Interesting. I'm surprised this isn't an example in the documentation.

Thanks for confirming it works. I will check it out and let you know if I find anything.

Also not clear on the billing, are you seeing some billing reporting to reflect the costs for fine tuning?

MLskeri

Yeah, I pretty much saved the document that @ruthseki provided as a PDF and tossed that into Claude to help me edit the JSONL structure to include the second image into the user role.

I was using VertexAI where I made a Python script to upload the images in my .mat file into GCS since I had around 1000 images and couldn't bother to upload manually. Before I began the tuning, I know I had like ~$180 in free credits left. It used all of it, I am not even sure where it stopped in the tuning progress so I honestly have no clue how much it would even cost. Below is an image of my billing overview:

I currently just wrote own Python script to use the free Gemini API key and it seems to be doing good. Was able to tune the 1.5-Flash-001 model. I think it's the only one we can tune with the Gemini API key? I tried many other models but it always provided an error:

Fine-tuning Gemini Model with x2 images and text as input