So, I have been using the Chat completion endpoints (3.5-Turbo) for a while. And I have built a nice product around it. And now I have reached a point where I need to start fine tuning it to further increase the performance of my system (yes I did try with prompt engineering and single/multi-shot examples, they were not sufficient).
To get hands-on, I tried running a fine-tuning (3.5-Turbo) job with OpenAI API. I will admit I have no idea how this works. I just read and followed the guidelines in the official OpenAI documentation on how much and how to prepare the data. And did the API calls.
But now that I fine-tune it, I do see that, empirically model is significantly much better. But then I looked into my fine-tuning just out of curiosity then I found this training loss graph (image attached).
Can someone explain to me in layman term what this means? Should the training loss monotonically decrease? What are the implications of this my model? Will this performance further increase if somehow “clean-up” and have “better”/more data?
Also, I want to learn more about fine-tuning LLMs, with more empahsis on practical guidelines and best-practices, especially to build out my product. Can someone suggest useful resources?
Thanks folks!