Question on Finetuning: Can you hardcode images or upload image responses via the image_url subkey of content?

stephtierney88 · August 27, 2024, 6:21am

something like

Example with an Image URL:

json

{
  "messages": [
    {"role": "system", "content": "You are an assistant helping a user navigate a game."},
    {"role": "user", "content": "Where should I go next?"},
    {"role": "image", "image_url": "https://example.com/path/to/image.jpg"},
    {"role": "assistant", "content": "Move to the right to avoid the obstacle and proceed to the next level."}
  ]
}

Example with Base64 Image Data:

json

{
  "messages": [
    {"role": "system", "content": "You are an assistant helping a user navigate a game."},
    {"role": "user", "content": "Where should I go next?"},
    {"role": "image", "content": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQE... (rest of base64 string)"},
    {"role": "assistant", "content": "Move to the right to avoid the obstacle and proceed to the next level."}
  ]
}

{
  "messages": [
    {
      "role": "user",
      "content": "type:image_url lEQVR4nOydeVxN2xfAV7d5nkeao2iS...."
    },
    {
      "role": "assistant",
      "content": "type:text pyag: move(1100, 500); # Notes: The cursor is now positioned over a highlighted section in the game log interface.",
      "weight": 1
    }
  ]
}

ie facing a wall in game, prob should give some example responses of commands the screenshot-agent parser would use to effectively navigate away from the wall and use the weightings to demote waltzing into the wall esp if already facing it, or to help it understand the player character through fine tuning slowly use weighting, correct alternatives, and post-processing to guide the model toward better performance (maybe hopefully help the model ‘see’ the center character tile better and navigate smoother) ?

jr.2509 · August 27, 2024, 6:24am

Hi there! Fine-tuning with images is currently not supported unfortunately.

stephtierney88 · August 27, 2024, 6:30am

i appreciate the reply thank you. i couldn’t find anywhere in the docs that explicitly said that.

nphat44444 · August 27, 2024, 7:09am

Currently, the fine-tuning process for OpenAI models, including GPT-4 and GPT-3.5, primarily supports text-based inputs. This means that the standard fine-tuning process is designed to work with datasets composed of text data. The inclusion of images directly into the JSON structure for fine-tuning, especially in Base64 encoded format within the same JSON as text data, is not directly supported in the manner you’ve described.

However, there are a few approaches you might consider to work with multi-modal (text and image) data:

Separate Processing: One approach could involve processing the image data separately to extract relevant text or features that can be described in text form. This text representation of the image data could then be included in the fine-tuning dataset alongside the other text inputs. This method requires an additional step of image processing and analysis before fine-tuning.
Use of Descriptions: Instead of embedding the images directly, you could use detailed descriptions of the images as part of your training data. This would allow the model to learn from the descriptions, which could be a workaround for integrating the essence of the image data into the fine-tuning process.
Exploring GPT-4’s Multimodal Capabilities: While the fine-tuning process itself may not support direct integration of Base64 encoded images within the JSON structure, GPT-4 offers multimodal capabilities that allow it to process both text and image inputs. You might explore how these capabilities can be leveraged in your application, although it’s important to note that this would be more about using the model post-fine-tuning rather than integrating images into the fine-tuning process itself.

Considerations and Limitations:

Data Representation: Converting images to a text-based representation or description requires careful consideration to ensure that the essence of the image is captured accurately.
Model Capabilities: The effectiveness of integrating image data through descriptions or extracted text will depend on the model’s ability to understand and generate responses based on these representations.

Guidance:

For integrating image data into your project, you might consider using the Vision capabilities of GPT-4 for processing images and generating text-based responses. This could be a separate step from fine-tuning but can enrich the model’s responses based on image inputs.
Review the Fine-tuning guide for detailed instructions and best practices on preparing your dataset and executing the fine-tuning process.

Topic		Replies	Views
Multimodal (image) fine tuning with GPT-4 API gpt-4 , fine-tuning	17	7341	October 3, 2024
Fine-tuning gpt-4o-2024-08-06 with images? API fine-tuning	2	1483	October 3, 2024
Fine-tuning gpt-4o on image data API fine-tuning , fine-tune	9	1146	November 29, 2024
Issue while Fine-Tuning GPT-4o with Base64 Images API	3	112	February 27, 2025
Can I use images with fine-tuned model API image-reading , gpt-4o-mini	4	206	October 10, 2024

Question on Finetuning: Can you hardcode images or upload image responses via the image_url subkey of content?

Related topics