Fine-tuned gpt-4o model has lower validation loss, but completely fails basic formatting requirements

I’ve been working on a pretty complex emotive categorization task in text, with categorizations and alternatives for specific phrases given in JSON format. I’ve made many fine-tunes with gpt-4o-mini and gpt-3.5-turbo, and there have still been accuracy issues, so before doubling my training set again, I figured I’d see what happens when I spend the extra money to try finetuning with gpt-4o.
At first, looking at the graph seemed pretty promising compared with the performance from gpt-4o-mini. The gpt-4o version has less validation loss and didn’t seem to overfit as quickly:

Compared with the performance from gpt-4o-mini, which had a higher minimum validation loss before seeming to overfit on the third epoch:

However, when I came around to actually testing the gpt-4o model, it completely failed to utilize the basic json format I wanted it to utilize – something I could get gpt-4o-mini and gpt-3.5-turbo to understand with a training set of 15 examples or less. Even with the explanation in the system prompt, and training on 3 epochs of 80 examples of proper formatting, this gpt-4o model has decided on a formatting method of its own–something which is proper JSON, but not what I want it to do at all, or what a single training example ever did even once. It’s basically completely useless for my purposes.

One thing I’m considering is that there could have just been some fluke in how this model was trained. During training, it gave a strange error, “The job experienced an error while training and failed, it has been re-enqueued for retry”, which I’d never seen before:

I don’t really want to spend another $10 and just “try again” based on this theory, though.

Is gpt-4o known to have problems with JSON formatting? I know since it’s a larger model it can be harder to train, so you might need a larger training set or more epochs. Could this problem just be a result of that? It just seems too weird because it’s such a fundamental formatting problem.

No.


What exactly are you trying to do here?

You should be “prompt-jamming” the model first with examples before running fine-tunes to see how it reacts.

The fact that it’s not capable of formatting JSON anymore means there’s something wrong with your training data. Can you share some samples of it?

1 Like

You’re right, I made a stupid mistake by accidentally using an old version of the system prompt that was less clear about the JSON formatting requirements. I forgot to change it in my API call from my earlier tests.

1 Like

Oof. Well. $10 education ain’t the worst in the world :rofl: