I was running a training data set with approximately 8 Million training tokens and with 3.5-1106, it ran very well with little to no outright errors and very solid performance within my specific project parameters. Once I saw 0125 i used the exact same training data to fine tune that as well and am a bit surprised at the difference in quality with outright errors of random characters and complete tapering off into nonsense towards the end of the output.
I was wondering if anyone else was having similar results or if there are simply new parameters that I should be considering?
You can certainly view the learning hyperparameters that were actually employed in both runs.
However, the reduction in price hints at the reduction in performance, increase in perplexity, and attention masking that makes input context its own “RAG” gamble if instructions or data are considered. This was unexpected, as this was supposed to be a “lazy” fix and a function fix, not just a worse model that might as well have just been trained on RLHF for how much of a chat parrot it is.
At the least, you can see how this performs at much lower sampling parameters of temperature and top_p. If the AI has no quality of differentiating words from garbage, you have to differentiate the tokens yourself with sampling restrictions.
The same effect is seen with the base models. Davinci-002 can barely be compared with GPT-3 Curie, as the only thing that makes their model downsizing and damage still work is the post-training.
A fine-tune model can certainly use more system prompting to have its task reinforced, or to take it in a different direction. That’s really the only training that you have besides the fine-tune itself, the prompting context of instructions, examples, and even example conversation exchanges that demonstrate a task as “training”.
A fine-tune model can be used as a base for another fine-tune job, continuing refinement without a reinvestment of the total tokens it was already trained on. You just specify the fine-tune model name instead of the base model name when you fine tune on another training file.
Ah got it. sorry for the late responses. The part that drives me out of my mind is that I’ve done this as well, basically training off of the last functional model that i’ve been able to train which was turbo-1106 at the beginning of Feb, and I’m still getting these non-sensical results. So that’s why I was thinking there might have been some other default variable adjustments on the training side that I might be missing (Top-P, temperature, etc.). My use case is more style/tone fit into a specific structure, so I’ve been just making small adjustments to the training data over and over to push the rather subjective performance little by little. So for it to just turn to straight non-sensical ass feels like months of work down the toilet.