i have fine tuned a GPT-3 model to answer short Web-MD queries in a chatbot style. (Only for academic purposses).
Now i want to evaluate my models perfomace using at least a few quantitative metrics, such as accuracy and percision.
Is this even possible with language models?
I have searched a lot and did not rally find a solution on how to calculate accuracy of a language model, because the outputted sentence might be correct in a lot of diffrent ways, as long as the core statement is correct.
Any ideas or articles on this?