The babbage-002 fine tuned model generates invalid category

anjali.nair · December 20, 2023, 5:13am

I fine tuned a babbage-002 model with my data having 30 categories. After that while testing the model it was predicting invalid category(the categories not in my original 30 categories) for some descriptions.
Do you know why this happens or how it can be fixed?

Some details:
The model was trained for 4 epochs.
Training data length was 84K.
30 categories.
Stop word is " end"
The formatting is => {“prompt”:“$part description ->”,“completion”:" category end"}
The error is “Belt - Drive”(original category) | “Belt - Drive not important”(incorrect completion)

elmstedt · December 20, 2023, 7:52am

There are three ways to fix this,

More and better training data.
Training on a stronger model.
Both of the above.

Remember, the models are text prediction engines, they’re not actually classifiers (they just “act” like a classifier if asked to).

The steps I would take to address this would be this,

Change your clarification system so that each class is a single token. Numbers are a good bet here, but any thirty individual tokens will work. Since you’re using the API, you can just do a lookup and substitution for the actual class names you want.

Then all of your training data just looks like,

Message
1

Message
29

Etc.

This will have the benefit of somewhat reducing the token cost of your training and make it easier for the model to reliably act as a classifier.

It may be beneficial to add another token like . or ; after the number and use that as a stop token. That way the model learns to terminate each class and you don’t get runaway generation.

You said the training data length was 84k, is that tokens or examples?

Having more total examples is always better. If you can find or synthesize more, you should.

4-epochs is an okay number for fine-tuning, but fewer epochs with more data would probably be better.

_j · December 20, 2023, 8:15am

This is a case where I kind of disagree. The way one would normally fine-tune, we want AI to infer, inference being the ability to come up with new answers that were never in the training set, but are only based on learning behaviors. Filling in the gaps in fine-tune from its corpus knowledge.

Here, you really want a “canned answer machine”, that can only give you back what you’ve put into the training file for responses. Overfitting, if it was any other AI language application.

There are other hyperparameters besides “epochs” now exposed. They are not seen in the slick GUI either to be specified or recalled, you have to use API calls. learning_rate_multiplier for example, you can reduce your token cost by increasing that instead of increasing epochs. 2.0 has been seen used by the auto settings on smaller training sets, and you can pull down what was used on your own “auto” job. And continuing on an existing fine-tune, if you want to see the improvement or degradation of running more passes of learning.

Here’s another very similar thread today, where you can explore to come up with some ideas, because fine-tune is very much charting your own course, with sparse guidance from OpenAI, and your own experimentation needed:

elmstedt · December 20, 2023, 8:29am

You make a good and interesting point.

My counter argument would be that by overfitting on the training data potentially diminishes the model’s ability to do inference on out-of-bag examples.

By increasing the number of novel examples with the same 30 classifications and doing it for the same number of total training examples, you should get better results.

For instance, I would expect (generally) that,

40,000 examples \times 3 epochs would outperform 30,000 examples \times 4 epochs for a 30-class, classifier.

There’s undoubtedly instances where this isn’t true, but all things being equal I would probably pick the first over the second (I suppose largely dependent on the variability in the text to be classified and how semantically similar the descriptions of some categories are).

I’d also be interested in seeing the training-loss curve in this case.

All said though, I still think the best first step would be to recode the classes to individual tokens, and now that I think for a second just limit the output to a single token and do away with all the stop-token stuff.

Topic		Replies	Views
Fine tune model auto complete label category - how to stop this? API	5	442	December 20, 2023
Finetuned Classification providing invalid response as classification Prompting	5	735	November 8, 2022
Using the new fine-tunes endpoint for binary classification API fine-tuning , python	10	1172	January 11, 2024
Issues with Fine-Tuned Babbage-002 Model Returning Incorrect Completions Prompting gpt-4 , chatgpt	13	970	December 21, 2023
Struggling with poor performance on fine-tuned davinci model API	15	2236	December 20, 2023

The babbage-002 fine tuned model generates invalid category

Related Topics