Ada model fined tuned for classification hets delusional when fed with thousands of records

Hi,
I have a fine tuned Ada model for classification for around 50 classes.

My prompt end pattern is “\n\n###\n\n” and the response end pattern is “$$$”

It works really great but i noticed that it gets delusional after greater amount of prompts in a row, like few hundreds. Setting its temperature to 0 helped at first but then I ran it again for 9k records and it started to give false results again.

By false result I understand that the class returned for this same prompt is different (wrong) when it is ran in the queue of n thousands records than the result I get when I feed it once separately to this same model with those same parameters.

I feel like the model gets saturated in some way when its fed with that many examples? That there is some lekeage of tokens?

I would appreciate any clue or direction what can be the reason and what steps I can take to improve the results

1 Like

Welcome @maksymilianpiechota

Are you batching the prompts?

The docs mention not to exceed 2048 token for classification.

PS: embeddings is another cost-effective method for classification

1 Like

I am not batching, I am sending one prompt after another (waiting for the first request to be responded)

I am classifying job titles so this is rarely more than 3 words, so I suppose I do not exceed the 2048 tokens limitation.

Thanks for the embeddings tip, I will look into it, but for now I need to resolve the issue with my fine tuned model.

Is there anyone who can help?

Is anyone from OpenAI monitoring the issues raised on this forum?

Can you confirm how it’s performing when you use batching @maksymilianpiechota ?

My suggestion is:

  • Use Babbage with a dataset of 700-900 examples.
  • You need more context in your dataset.

If I had some examples of your dataset, I would probably have a clue what is wrong.

I have trained the ADA with around 10k examples.

Now I have trained the babbage as you suggested with 800 examples, considering 37 classes, I had 22 examples per class.

And I get much worse results in the babbage then the previously trained ada (even testing separately, for just one classification)

P.S. I will consult with my customer if I can share the data set

Here are some more tips:

  • Add more volume, if 800 is not enough (more diverse data is better). Depending on the training data If you add more data the performance will likely increase.

  • The best data is similar to what you’ll use the model for. Try to format it in a clear way that makes sense to read it. for instance

Prompt:

Happy day → sounds positive

Completion:

True

  • Set the prompt loss weight to smaller value like 0.1, this make the prompts less sensitive

  • For N_epochs with more data you may need fewer epochs

  • Temperature 0