Fine-tuning results question

I fine-tuned GPT-3.5-turbo-1106, but the results on the test set were unexpectedly good, so I proceeded to do additional work with the validation set. However, the results on the validation set were not as good as those on the test set. I’m curious about the reasons for this phenomenon. I checked to see if the test set results were too good because the data might have been included in the training data, but that was not the case.

I used jsonl format for fine-tuning,
and used csv format for testing.
maybe that’s the problen?

There are a lot of different aspects that can contribute to this problem, and most of them are broader in scope and difficult to identify. It could have something to do with the fine tuning data, the fine-tuning method, the amount of data, etc.

My best guess is probably there’s some overfitting going on. You can read this to get a good understanding: Exploring Overfitting Risks in Large Language Models | NCC Group Research Blog | Making the world safer and more secure

Now, again, I can’t confirm this is the case of what happened, but hopefully this is a helpful suggestion that points you in the right direction.