I have fine tuned a “davinci” base model using hundreds of QA pairs, to build a a customized chatbot.
After the model is built, the model can return a response by calling
Now, I want to evaluate the model performance, should I compare the model performance and true value (from QA pairs) by calculating text similarity? If so, how can I calculate the accuracy/F1 of the fine tuned model?
OpenAI built-in fine tuning tool already offers the possibility of incorporating a validation dataset so you can check out some metrics while fine tuning and prevent your model from overfitting, for instance. You’ll also get those metrics at the end of the fine tuning process. Aren’t these metrics useful for you? Maybe you have a different use case? Feel free to let us know, hope it helps!!
But when I run “!openai api fine_tunes.results -i <model_fine_tuned_name>’” to get the fileID, it only returns a txt format of training stats, I can’t find the fileID.
In order to retrieve validation metrics you need to provide a properly-formatted validation file when creating the fine-tuning job. The same kind of file (.jsonl) as your training data (but ensuring that there is no overlapping data).
You can find all the info in the official guide here.