Best way to convert payroll reports to JSON

Hey y’all,

Currently trying to convert large payroll reports to json where basically just each piece of data per person is just a key and value so “gross pay: 1,000, name: Bob marley” etc. etc… I have a pretty good ocr setup from microsoft but GPT4 cannot seem to accurately produce the correct JSON (yes i have it turned on) that i need. Any tips ?

Why are you using an OCR and GPT-4 to create the JSON?

Payroll reports follow a rules based system & consistent structure. You can use that to structure the JSON yourself.

nah because the formats differ a-lot, it can be regular payroll pdfs, excel files etc. etc. sometimes just even a text file someone drew up. Just genuinely surprised this seems like a task that LLM’s should handle.

They can but it just seems unnecessary.

I take in numerous receipts and invoices, and bank information using Google Document AI. They do follow rules even if they are in different areas.

You can give the model the “fields” to search for and it will return the JSON structure.

Curious but what if we don’t know the exact labels before hand ? We just want each data point per person ? + the documents are pretty large. Some payroll reports will just have gross pay per person and some will have 10 data points per person, like gross, net, taxes paid, commission etc. etc.

Ideally you’d want a schema / structure so you can actually perform analysis on it.

If you can’t though you can “ask” an OCR to just gather everything. Notice how it is already in a JSON-like structure:

But, there is a lot of noise if you decide to do this. WHich is probably what you’re struggling with. It helps TREMENDOUSLY to have a pre-defined schema

I suppose we could use llm’s to just get the schema before hand