Training accuracy function in fine-tuning of ChatGPT 3.5-turbo

agerhsun · October 28, 2023, 6:25am

I am fine-tuning a model aimed at translating human language queries into one of the specific programming languages, utilizing a large and diverse dataset. During the training process, the training metric reaches 99.5%, but on the validation dataset, it caps at 85-86% and does not increase by even a percentage point. In reality, when I test the trained neural network on examples, it yields only 66% correct answers.

We have repeatedly cleaned, augmented, diversified, and randomized our dataset, yet the result remains unchanged.

Increasing the number of training epochs (from 1 to 2) also has no effect, except that overfitting now occurs.

I suspect that the error evaluation function used in fine-tuning ChatGPT 3.5-turbo does not merely compare the sequence of tokens with the original sequence, but perhaps compares a bag of tokens instead, leading to subpar results.

I would appreciate it if you could share if you had a similar experience:

What values were you able to reach on the training and validation datasets while fine-tuning ChatGPT 3.5?

What error function is using for the training in OpenAI?

PS. Please do not suggest utilizing RAG, ICL, stacking, etc., as we are already employing them. It’s crucial for me to understand what’s happening specifically during the fine-tuning stage right now.

Thank you very much!

Braintrust · November 1, 2023, 11:20pm

Have you tried changing the # of rows you are training on? or tried changing the input prompt? Seems tricky.

Here’s also a guide on how to use finetune and evaluate using Braintrust:

Braintrust(building the enterprise ai stack): https://braintrustdata.com/

_j · November 2, 2023, 1:51am

What’s going on is the the gpt-3.5-turbo model doesn’t have good architectural inference, and it is worse that what it once was and what GPT-3 once was. Instead, OpenAI fine-tunes on millions and millions of user training pairs and conversations to give the model fluency. It doesn’t need language inference skill to extract meaning between the questions when the AI has just been pre-tuned on everything a human could imagine saying to an AI. Unfortunately for you that means high perplexity.

Topic		Replies	Views
Poor fine-tuning results of GPT 3.5 API	3	1122	February 21, 2024
Avoid overfitting during the fine-tuning of gpt-3.5 turbo API gpt-35-turbo , fine-tuning , fine-tuning-problems	4	2945	December 21, 2023
Questions about fine-tuning GPT-3.5-turbo API fine-tuning	1	2144	October 29, 2023
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1836	December 18, 2023
GPT 3.5-turbo-0125 (Downgrade) API	4	1224	March 30, 2024

Training accuracy function in fine-tuning of ChatGPT 3.5-turbo

Related topics