Validating and measuring quality of AI generated summary of json data

I am using GPT-4 to create a summary of json data. The json data contains information regarding a site’s users and their activity. The summaries I generated using prompting and few-shot look good. However, there are occasional instances of data omission, mis-reporting and some cases of hallucination. For this reason and other reasons , I am building a validation part that uses LLMs to extract info in json format from the summary and comparing that info to the info in json using another prompt. I tried to look for standard quality metrics , but they seem to be for text-to-text comparison and not json-to-text comparison or metrics that are “unsupervised” in nature.

Looking for answers from people that come across similar need and what their approach to solve this.

I tried a similar approach a few months back, where with a couple of samples of json and extracted values, I was able to use GPT to extract the key:value pair from NL. However, it was very inconsistent often, the variability in language expressing the value would cause it to miss out on critical key value pairs.

With such tasks, I found that human eval is the best possible metric. Nowdays, as a part of my daily pipeline testing, I compare the outputs against gold standard outputs that I have just to ensure the performance looks upto date. Time-consuming, yes but worth it.


IIUC your problem, I have encountered the exact same one, which has actually made me develop a whole platform for testing structured responses.

I’m not sure it fits your whole use case (tbh I’m not sure I understood all of it), but the platform can help you validate the responses json schema, as well as exact expected values (in case some of it is deterministic).
You can use it both for development, and periodical testing.

You’re more than welcome to check it out if relevant: Promptotype. Lmk if you have any questions/ feedback.