The babbage-002 fine tuned model generates invalid category

I fine tuned a babbage-002 model with my data having 30 categories. After that while testing the model it was predicting invalid category(the categories not in my original 30 categories) for some descriptions.
Do you know why this happens or how it can be fixed?

Some details:
The model was trained for 4 epochs.
Training data length was 84K.
30 categories.
Stop word is " end"
The formatting is => {“prompt”:“$part description ->”,“completion”:" category end"}
The error is “Belt - Drive”(original category) | “Belt - Drive not important”(incorrect completion)

There are three ways to fix this,

  1. More and better training data.
  2. Training on a stronger model.
  3. Both of the above.

Remember, the models are text prediction engines, they’re not actually classifiers (they just “act” like a classifier if asked to).

The steps I would take to address this would be this,

Change your clarification system so that each class is a single token. Numbers are a good bet here, but any thirty individual tokens will work. Since you’re using the API, you can just do a lookup and substitution for the actual class names you want.

Then all of your training data just looks like,

Message
1

Message
29

Etc.

This will have the benefit of somewhat reducing the token cost of your training and make it easier for the model to reliably act as a classifier.

It may be beneficial to add another token like . or ; after the number and use that as a stop token. That way the model learns to terminate each class and you don’t get runaway generation.

You said the training data length was 84k, is that tokens or examples?

Having more total examples is always better. If you can find or synthesize more, you should.

4-epochs is an okay number for fine-tuning, but fewer epochs with more data would probably be better.

1 Like

This is a case where I kind of disagree. The way one would normally fine-tune, we want AI to infer, inference being the ability to come up with new answers that were never in the training set, but are only based on learning behaviors. Filling in the gaps in fine-tune from its corpus knowledge.

Here, you really want a “canned answer machine”, that can only give you back what you’ve put into the training file for responses. Overfitting, if it was any other AI language application.

There are other hyperparameters besides “epochs” now exposed. They are not seen in the slick GUI either to be specified or recalled, you have to use API calls. learning_rate_multiplier for example, you can reduce your token cost by increasing that instead of increasing epochs. 2.0 has been seen used by the auto settings on smaller training sets, and you can pull down what was used on your own “auto” job. And continuing on an existing fine-tune, if you want to see the improvement or degradation of running more passes of learning.

Here’s another very similar thread today, where you can explore to come up with some ideas, because fine-tune is very much charting your own course, with sparse guidance from OpenAI, and your own experimentation needed:

1 Like

You make a good and interesting point.

My counter argument would be that by overfitting on the training data potentially diminishes the model’s ability to do inference on out-of-bag examples.

By increasing the number of novel examples with the same 30 classifications and doing it for the same number of total training examples, you should get better results.

For instance, I would expect (generally) that,

40,000 examples \times 3 epochs would outperform 30,000 examples \times 4 epochs for a 30-class, classifier.

There’s undoubtedly instances where this isn’t true, but all things being equal I would probably pick the first over the second (I suppose largely dependent on the variability in the text to be classified and how semantically similar the descriptions of some categories are).

I’d also be interested in seeing the training-loss curve in this case.

All said though, I still think the best first step would be to recode the classes to individual tokens, and now that I think for a second just limit the output to a single token and do away with all the stop-token stuff.

1 Like