RL Finetuning with Text + Image Data - Image Data not Supported

Reinforcement Fine-Tuning with Images - Getting “method does not support it” Error

Problem Description

I’m trying to use OpenAI’s reinforcement fine-tuning API with multimodal data (text + images). However, I’m encountering an error that seems contradictory to the API documentation.

Error Message

The job failed due to a file format error in the training file. 
Invalid file format. Input file <file id> contains images, 
but the method `reinforement` does not support it.

Documentation Confusion

The OpenAI reinforcement fine-tuning documentation states:

“Input messages may contain text or image content only. Audio and file input messages are not currently supported for fine-tuning.”

This suggests that images should be supported, but the error message indicates otherwise.

My Data Format

I’m formatting my training data in JSONL with the following structure:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "prompt..."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://..."
          }
        },
        ...
      ]
    }
  ],
}

Questions

  1. Does reinforcement fine-tuning actually support images? The documentation suggests yes, but the error suggests no.

  2. Is the image format different for reinforcement fine-tuning vs. simple API calls? I’ve structured my messages to match regular API calls, but not sure if for finetuning the format needs to be handled differently.

Has anyone successfully used images with reinforcement fine-tuning? Or is this a known limitation that’s not clearly documented?

Any guidance or workarounds would be greatly appreciated. If images aren’t supported, it would be helpful if the documentation could be updated to clarify this limitation.

Thank you!