Hi,
I have a fine tuned Ada model for classification for around 50 classes.
My prompt end pattern is “\n\n###\n\n” and the response end pattern is “$$$”
It works really great but i noticed that it gets delusional after greater amount of prompts in a row, like few hundreds. Setting its temperature to 0 helped at first but then I ran it again for 9k records and it started to give false results again.
By false result I understand that the class returned for this same prompt is different (wrong) when it is ran in the queue of n thousands records than the result I get when I feed it once separately to this same model with those same parameters.
I feel like the model gets saturated in some way when its fed with that many examples? That there is some lekeage of tokens?
I would appreciate any clue or direction what can be the reason and what steps I can take to improve the results
Add more volume, if 800 is not enough (more diverse data is better). Depending on the training data If you add more data the performance will likely increase.
The best data is similar to what you’ll use the model for. Try to format it in a clear way that makes sense to read it. for instance
Prompt:
Happy day → sounds positive
Completion:
True
Set the prompt loss weight to smaller value like 0.1, this make the prompts less sensitive
For N_epochs with more data you may need fewer epochs