Why does my vision fine-tuning job keep failing?

Hello,

I am trying to fine-tune the model gpt-4.1-2025-04-14 (I tried also 4o model) using vision fine-tuning, following the official instructions provided here:
https://platform.openai.com/docs/guides/vision-fine-tuning

Unfortunately, the fine-tuning job consistently fails with the following error message:

“The job failed due to a file format error in the training file. Error processing images. Please re-upload the file and try again.”

The main issue is that this error message is not specific enough to indicate what exactly needs to be fixed. I have strictly followed the JSONL structure described in the documentation, but the job still fails.

I tried multiple approaches to rule out data issues:

  • providing images via public URL links hosted on cloud storage,

  • embedding images directly in the JSONL file using base64 encoding,

  • and even using the officially published example from the documentation (“cookbook data”, specifically the cheese assistant example - I duplicated the example 10 times in JSONL format in order to meet the minimum requirement of 10 training examples. ).

All of these approaches result in the same error message.

Could you please advise whether there is a known issue with vision fine-tuning for this model, or provide more detailed information about what conditions can trigger this error? Any guidance on how to successfully run a vision fine-tuning job would be greatly appreciated.

Finally I am goint to show you an example of my data:
{“messages”: [{“role”: “system”, “content”: “You are a structured data extraction engine.You must strictly follow the output schema described by the user.You must never invent additional top-level objects or arrays.You must never split one logical record into multiple JSON objects.If the table contains multiple rows, they must be represented ONLY inside arrays, never as separate top-level objects.Output must be valid JSON only.No markdown. No explanations.Accuracy is more important than consistency.Words must be transcribed exactly as seen, character by character, without spelling correction or normalization. Do not correct malformed wordsEach table cell must be interpreted independently. Never assume that text in neighboring rows or columns is identical.Handwritten digits may be ambiguous. A single vertical stroke ‘|’ inside a numeric cell often represents the handwritten digit ‘1’. Do not treat such marks as empty cells by default.”}, {“role”: “user”, “content”: “Extract the table into JSON.You MUST return EXACTLY ONE JSON OBJECT (not an array).Top-level JSON structure (exactly this):{"Location": string, "Weather": string, "Date": "DD-MM-YYYY", "Time": string, "Obs": string, "species": [{"Name": string, "Number": string, "Remarks": string}]}IMPORTANT STRUCTURE RULES:- There must be exactly ONE top-level JSON object.- The top-level fields Location, Weather, Date, Time, and Obs appear ONLY ONCE.- All species rows MUST be placed inside the "species" array.- Do NOT create multiple JSON objects for individual species rows.- Do NOT repeat Location, Weather, Date, Time, or Obs inside species entries.TABLE INTERPRETATION RULES:- Each visual species row corresponds to ONE object inside the "species" array.- Even if header cells appear blank in the table, copy their values from the table header if present.- If a cell is empty, return an empty string "".- Do not duplicate values into other rowsCOLUMN RULES:- Name: copy verbatim from the Species column.- Number: copy verbatim from the Nos. column (including symbols like +, c., .). If a cell contains a single vertical line "|" interpret it as the digit "1".- Remarks: copy verbatim.- Weather: weather information for that day.- Date: return in DD-MM-YYYY format.- Time: return as HHMM-HHMM or HHMM if only one time is present.VALUES:- Transcribe only what is explicitly visible in each cell.- For textual fields, preserve exact characters even if the result is not a valid word.- Never infer missing values.- Never copy or propagate values from other rows or columns.- Do NOT reuse text from another row even if it appears similar - write exactly what is seen in each cell.- Output ONLY valid JSON. No markdown. No commentary.”}, {“role”: “user”, “content”: [{“type”: “image_url”, “image_url”: {“url”: “https://pub-43104181f33740e592bf40bde14d7415.r2.dev/batch_6/page_5.jpg”}}]}, {“role”: “assistant”, “content”: “{"Location":"Ghadira Nature Reserve","Weather":"Sunny. Wind: med. NW’ly","Date":"05-06-2006","Time":"0530-1430","Obs":"CG, AC, RV","species":[{"Name":"Mallard","Number":"3","Remarks":"."},{"Name":"Little Egret","Number":"1","Remarks":""},{"Name":"Moorhen","Number":"25+","Remarks":""},{"Name":"Little Ringed Plover","Number":"11","Remarks":"Mixed pair copulating on island near dune."},{"Name":"Little Stint","Number":"c.26","Remarks":""},{"Name":"Swallow","Number":"2","Remarks":""},{"Name":"Red-rumped Swallow","Number":"1","Remarks":""},{"Name":"Yellow-legged Gull","Number":"5","Remarks":""},{"Name":"Cetti’s Warbler","Number":"10+","Remarks":""},{"Name":"Zitting Cisticola","Number":"10+","Remarks":""},{"Name":"Reed Warbler","Number":"1","Remarks":""},{"Name":"Sardinian Warbler","Number":"15+","Remarks":""},{"Name":"Spanish Sparrow","Number":"150+","Remarks":""},{"Name":"Greenfinch","Number":"2","Remarks":"(1♂)"},{"Name":"Swift","Number":"25+","Remarks":""},{"Name":"Garden Warbler","Number":"1","Remarks":"heard singing"}]}”}]}

Thank you in advance.

You discover a problem that persists.

I’ll see if a mod can pass along to fix the broken fine-tuning endpoint so that images will work.

2 Likes