I am fine-tuning a model aimed at translating human language queries into one of the specific programming languages, utilizing a large and diverse dataset. During the training process, the training metric reaches 99.5%, but on the validation dataset, it caps at 85-86% and does not increase by even a percentage point. In reality, when I test the trained neural network on examples, it yields only 66% correct answers.
We have repeatedly cleaned, augmented, diversified, and randomized our dataset, yet the result remains unchanged.
Increasing the number of training epochs (from 1 to 2) also has no effect, except that overfitting now occurs.
I suspect that the error evaluation function used in fine-tuning ChatGPT 3.5-turbo does not merely compare the sequence of tokens with the original sequence, but perhaps compares a bag of tokens instead, leading to subpar results.
I would appreciate it if you could share if you had a similar experience:
What values were you able to reach on the training and validation datasets while fine-tuning ChatGPT 3.5?
What error function is using for the training in OpenAI?
PS. Please do not suggest utilizing RAG, ICL, stacking, etc., as we are already employing them. It’s crucial for me to understand what’s happening specifically during the fine-tuning stage right now.
Thank you very much!