Fine Tuning for Vision models

Uploaded image files in storage by ID is only a vision option for the Assistants API.

Fabricating your own type of JSON will not work.

https://platform.openai.com/docs/guides/fine-tuning#vision

Images can be provided either as HTTP URLs or data URLs containing base64 encoded images.

You can prepare your images by downsizing them so the shorter dimension is maximum 768 pixels, or for detail:low, downsize so the largest dimension is 512 pixels or below, along with the optimum compressed file format for the type for the sake of transmission (or you might use double-size jpg if pixel-level color is important). This is the same as is done server-side when sending at inference time, and will make your file upload smaller.