Fine Tuning for Vision models

I am trying to fine tune a gpt4o based model with Vision. I prepared my images (around 1000, 1.5GB) and I updated the to the File Storage. I then built my example messages. I am getting an error message when uploading my .jsonl file:

The job failed due to an invalid training file. Invalid file format. Line 1, message 3: Input tag ‘image_file’ found using ‘type’ does not match any of the expected tags: ‘image_url’, ‘text’

Can the model no be fine-tuned using images from the file store? base64 encoding is not really suitable for my many images use-case and making them available to public urls is also a bit complicated as of now.

Uploaded image files in storage by ID is only a vision option for the Assistants API.

Fabricating your own type of JSON will not work.

https://platform.openai.com/docs/guides/fine-tuning#vision

Images can be provided either as HTTP URLs or data URLs containing base64 encoded images.

You can prepare your images by downsizing them so the shorter dimension is maximum 768 pixels, or for detail:low, downsize so the largest dimension is 512 pixels or below, along with the optimum compressed file format for the type for the sake of transmission (or you might use double-size jpg if pixel-level color is important). This is the same as is done server-side when sending at inference time, and will make your file upload smaller.

Upload them to a server accessible by an district url for each image, and update your training file with image_url