I spent some time playing around with the OpenAI fine-tuning API and I discovered that noisy data still has drastic effects even on powerful LLMs like Davinci.
I took some time to write about how to use data-centric AI in this recently published article in KDNuggets so that you can improve your models too The results I found were quite eye-opening.
Let me know what you think!
I like it! Auto-detect and remove (or correct) outliers in your training data.
But why did you embed with
davinci-001 and not the newer
ada-002? Some reasoning here? Wondering if you would get better results since
ada-002 is supposed to be better and has way less dimensions than
Welcome to the community!
Thanks for sharing your results with us.
Cleaning datasets is going to be needed even more in the months/years ahead.