Confusion matrix precision recall for LLM response

I think this is not applicable for llm based applications, but wanted to check with other friends and experts in the community.

I have been asked to create a confusion matrix for the responses generated from llm and calculate precision, recall. Is it really applicable for LLM applications? My thought is the actual model when it was created open ai should have already done that. How do I do a confusion matrix with the LLM response when i do not have a ground truth that exactly matches the LLM response. It may semantically match, but there is now way to match it like orange to orange.


May be when you use RAG architecture and you have the ground truth (context). Based on this, you can try to calculate the confusion matrix.