You simply need to define a score function.
If we assume that every web page processed contains the same set of elements so the JSON objects all have an identical schema, a very simple score function might be to award one point for every incorrect field. Then the comparison is simply to look at machine and human averages and, like golf, the lower score wins.
If there are different numbers or types of fields in different pages you’d just modify the score to be some type of weighted average.
Or maybe some fields are more important than others, so you’d weight those more heavily, etc.
In terms of accuracy though, you might have better luck asking the model to locate data points individually, or first asking a model to generate a bullet-point list of all data in the page then pulling the data off interest out in a second pass.
You can also have the model critically evaluate its own work and operate on its answers.
Given the exceptionally low cost of 4o-mini, doing multiple passes through the text is feasible.
The other thing I’d question is how, exactly, you’re pulling the text from the page.
The way you’ve described it, it sounds like you’re just grabbing the raw text which likely loses formatting and structure which might help the model.
If that’s the case, you might consider pulling the HTML source, then converting that to markdown if it’s not well structured or something like XML or JSON if it is, before passing it to the LLM for parsing.
As always, a minimally working example of the type of HTML page you’re working with and the target output would be invaluable here.
If you could provide that, I’m sure someone here would be able to help you increase the model accuracy by an order of magnitude.