How to evaluate a completion(QA) model?


I have fine tuned a “davinci” base model using hundreds of QA pairs, to build a a customized chatbot.
After the model is built, the model can return a response by calling


Now, I want to evaluate the model performance, should I compare the model performance and true value (from QA pairs) by calculating text similarity? If so, how can I calculate the accuracy/F1 of the fine tuned model?

Thank you.

OpenAI built-in fine tuning tool already offers the possibility of incorporating a validation dataset so you can check out some metrics while fine tuning and prevent your model from overfitting, for instance. You’ll also get those metrics at the end of the fine tuning process. Aren’t these metrics useful for you? Maybe you have a different use case? Feel free to let us know, hope it helps!! :slight_smile:


I’d also like to add that there are wonderful interfaces for monitoring your progress and handling a lot of heavy lifting such as

1 Like

Thank you @AgusPG.
I am following this post to download the result.csv
How to See the contents of OpenAI Fine Tuned Model Results in Python using the OpenAI API - #3 by hariharasudhanm1, @guimaraesabri answer especially.

But when I run “!openai api fine_tunes.results -i <model_fine_tuned_name>’” to get the fileID, it only returns a txt format of training stats, I can’t find the fileID.

In order to retrieve validation metrics you need to provide a properly-formatted validation file when creating the fine-tuning job. The same kind of file (.jsonl) as your training data (but ensuring that there is no overlapping data).

You can find all the info in the official guide here.

Hope it helps!

1 Like

Thank you @AgusPG. I am looking into it.

1 Like