I am using GPT-4 to create a summary of json data. The json data contains information regarding a site’s users and their activity. The summaries I generated using prompting and few-shot look good. However, there are occasional instances of data omission, mis-reporting and some cases of hallucination. For this reason and other reasons , I am building a validation part that uses LLMs to extract info in json format from the summary and comparing that info to the info in json using another prompt. I tried to look for standard quality metrics , but they seem to be for text-to-text comparison and not json-to-text comparison or metrics that are “unsupervised” in nature.
Looking for answers from people that come across similar need and what their approach to solve this.