How can I measure JSON output accuracy?

I have a use case where a human is responsible for generating a JSON object based off of a webpage. It’s basically a form of data entry.

Using the extracted text from the webpage, have an GPT-4o-mini attempt to generate that same JSON object in order to pre-fill the data entry form.

How can I measure similarity between the LLM-generated output and the Human-generated output to better score different prompts and models on the pre-filling side?

I had a look at deepdiff, but it’s a little too strict to the schema.

I tried converting the JSON strings into python objects, but that doesn’t account for certain imprecision between the automated and the manual data entry.

@jed.arden it’s actually a really good use case for GPT-4o! In my experience it does a great job in comparing two JSON objects (or even other schemas), even when formatting and key ordering is different.

Here is a toy example I just tried. Prompt:

Compare the following two JSON objects:

{"foo": "bar", "items": ["a", "b"]}

{
  "items": ["a", "b"],
  "foo": "bar"
}

And its output:

Comparison:

Structure and Content: Both JSON objects contain the same keys (“foo” and “items”) with the same values. The values associated with these keys are identical.

Formatting: The second JSON object is formatted with line breaks and indentation, making it more human-readable. However, this difference in formatting does not affect the content or structure of the JSON data.

Order of Keys: As mentioned before, in JSON, the order of keys does not matter. Therefore, even though the keys are ordered differently in both JSON objects, they are still considered equivalent.

Conclusion: The two JSON objects are equivalent. The difference in formatting and key order does not impact their equivalence.

You could take this to the next level by getting it to provide a structured output of the comparison, e.g. by specifying is_equivalent = True | False and maybe providing some “difference calculation” (e.g. see Jaccard Index).

How can this solution be applied for partial matches?

The goal isn’t to measure whether or not the two are identical.

The goal is to measure how off the automated input is from the manual equivalent.

To extend this example further:

Expected standard
{“foo”: “bar”, “items”: [{“a”:1.0}, {“b”:“text”}]}

What was generated
{
“items”: [{“a”: 1}, {“b”:“TEXT”}],
“foo”: “bar”
}

A literal comparison. 1 / 3 elements match. So 33.3% match–hypothetically.
foo = foo
items[0][‘a’] != items[0][‘a’]
items[1][‘b’] != items[1][‘b’]

A relaxed comparison. 3/3 match. So 100% match.
foo = foo
float(items[0][‘a’]) = float(items[0][‘a’])
items[1][‘b’].lower() = items[1][‘b’].lower()

The actual schema I’m trying to compare is about 900 tokens large. The most complex nesting is: dict → list → dict → dict

I see. The example I gave was just a simple toy example, but you can then extend it with exactly what you wrote in your response - chain-of-thought prompting.

So in your system prompt you can specify a CoT and few-shot examples, more or less exactly as you have above, and see if that gets you further along.

If you’re using gpt-4o-mini the human will pretty much always do a better job and will be more consistent. But gpt-4o-mini will do it for significantly cheaper. The real question is are you satisfied with the job that gpt-4o-mini is doing and is it good enough given the amount of money it’s saving you or the increase in pages processed you’re getting.

On the consistency side you could probably process the same exact page 5 times and get 3 different results so you should evaluate that as well.

1 Like

I agree. The goal is not to replace the human. The goal is to augment or speed up the human.

Still need to tune the number of runs to hopefully normalize the result.

But before doing that, I’m trying to figure out how to quantify the difference between an automatically generated JSON object, and the human created JSON object.

You simply need to define a score function.

If we assume that every web page processed contains the same set of elements so the JSON objects all have an identical schema, a very simple score function might be to award one point for every incorrect field. Then the comparison is simply to look at machine and human averages and, like golf, the lower score wins.

If there are different numbers or types of fields in different pages you’d just modify the score to be some type of weighted average.

Or maybe some fields are more important than others, so you’d weight those more heavily, etc.

In terms of accuracy though, you might have better luck asking the model to locate data points individually, or first asking a model to generate a bullet-point list of all data in the page then pulling the data off interest out in a second pass.

You can also have the model critically evaluate its own work and operate on its answers.

Given the exceptionally low cost of 4o-mini, doing multiple passes through the text is feasible.

The other thing I’d question is how, exactly, you’re pulling the text from the page.

The way you’ve described it, it sounds like you’re just grabbing the raw text which likely loses formatting and structure which might help the model.

If that’s the case, you might consider pulling the HTML source, then converting that to markdown if it’s not well structured or something like XML or JSON if it is, before passing it to the LLM for parsing.

As always, a minimally working example of the type of HTML page you’re working with and the target output would be invaluable here.

If you could provide that, I’m sure someone here would be able to help you increase the model accuracy by an order of magnitude.

1 Like

The websites are generally press releases or news announcements which are likely to include information about companies which recently raise money.

From that information, the the data entry person’s responsibility is to pull:

  • Amount raised
  • Series
  • Name of the company
  • Name of investor(s)–this is a list.
    • Related individuals from those investors
  • Important people related to the company which raised money–this is a list
    • Name
    • Relationship
  • Date of raise

It is plausible for multiple values to be legitimately null as it might just not be there on the webpage.

The JSON schema driven functions occasionally hallucinate values when the correct value is NULL.

The model is fed the raw html along with the schema, and a system prompt describing what needs to happen.

The scoring solution seems to be the best I can do for now.

The next problem is how to normalize what are otherwise unordered lists so that I can consistently match like to like. Asking GPT4 suggests converting the list to a dict.