I am fine tuning gpt-4.1-mini and expected to have ~ one million tokens available to me, but it failed on my training data saying invalid format likely because I exceeded the token length. When I checked my data with tiktoken it was <90,000 tokens long (but indeed > 65536). Attached is a screenshot and here is the error text. I am wondering why I am not getting the token length I expected.
“The job failed due to a file format error in the validation file. Invalid file format. Example 1138 No completion or assistant tokens were found in the dataset (possibly because of truncation). It’s likely that all assistant tokens are outside of the context window (65536 tokens). Please check your dataset or use a model with a larger context window.”
I used ChatGPT’s agent mode to track down any reference to this limitation. The output was full of hallucination and links used as source citation without any reference in the destination pages, but reported on the same behavior as if it was found. Playing the agent mode “video” and the limited discussion within and some more Googling, I found this within Microsoft Azure (which has much better API documentation in general):
It states even for gpt-4o (4o instead of 4.1) that the training truncation is lower than the model context window length. I expect the same transfer of technology between OpenAI models and back to OpenAI themselves.
The 1000 tokens lower than 2**16 is likely allowance for OpenAI’s own injections of an initial “safety message” of anything they want. The gpt-4.1 model itself has a context window that is also 1000 tokens short of one million, likely the same reservation for their authority and control over your inference with undocumented messages being placed.
You should think of fine-tuning as training on responses - the AI generation. Fine-tuning is not good for imparting knowledge, and if you try, it should be an assistant that magically knows everything in its replies. Don’t use large input, assuming this will give a knowledgeable AI alone.
If you are placing large retrieval and attempting to train behaviors, I would simply pare that turn injection down. Chunk it, run embeddings against the user input and task context, and drop chunks below a threshold until the RAG budget or total is under the supervised training limit.