Training accuracy function in fine-tuning of ChatGPT 3.5-turbo

I am fine-tuning a model aimed at translating human language queries into one of the specific programming languages, utilizing a large and diverse dataset. During the training process, the training metric reaches 99.5%, but on the validation dataset, it caps at 85-86% and does not increase by even a percentage point. In reality, when I test the trained neural network on examples, it yields only 66% correct answers.

We have repeatedly cleaned, augmented, diversified, and randomized our dataset, yet the result remains unchanged.

Increasing the number of training epochs (from 1 to 2) also has no effect, except that overfitting now occurs.

I suspect that the error evaluation function used in fine-tuning ChatGPT 3.5-turbo does not merely compare the sequence of tokens with the original sequence, but perhaps compares a bag of tokens instead, leading to subpar results.

I would appreciate it if you could share if you had a similar experience:

What values were you able to reach on the training and validation datasets while fine-tuning ChatGPT 3.5?

What error function is using for the training in OpenAI?

PS. Please do not suggest utilizing RAG, ICL, stacking, etc., as we are already employing them. It’s crucial for me to understand what’s happening specifically during the fine-tuning stage right now.

Thank you very much!

Have you tried changing the # of rows you are training on? or tried changing the input prompt? Seems tricky.

Here’s also a guide on how to use finetune and evaluate using Braintrust:

Braintrust(building the enterprise ai stack):

1 Like

What’s going on is the the gpt-3.5-turbo model doesn’t have good architectural inference, and it is worse that what it once was and what GPT-3 once was. Instead, OpenAI fine-tunes on millions and millions of user training pairs and conversations to give the model fluency. It doesn’t need language inference skill to extract meaning between the questions when the AI has just been pre-tuned on everything a human could imagine saying to an AI. Unfortunately for you that means high perplexity.