Evaluate/Verify Extracted Structure Data

I use GPT- 4o to extract data from HTML Tables to a JSON list of objects. They are extracted accurately most of the times but for some input files, the values may be shifted to the right or left, due to empty cells used for formatting, alignment or missing values. I manually verified the output so far, but I would like to automate this process. How can I either programmatically or using an LLM verify the output? I am happy to do one api call for extracting the data and one for verifying the data, but I doubt the model can give an accurate response for verification if it failed to extract the data accurately in the first place.
Is it better to verify programmatically or using LLM? Are there other options?


you need to code your webcrawling to have “source”

and raw text, then use the metadata to categlogue the entry, hope this helps

If you have the data already as HTML tables, I would just transform it to JSON in plain old code, without making a call to LLM. This would be cheaper and 100% correct every time. If writing the code is too hard, ask the LLM to write the code for you and tweak it until it works 100% correctly. To verify the LLM output, you’d probably end up doing similar code anyway - with the added problem that the few percent chance of LLM giving wrong json will not go away.

1 Like