Currently trying to convert large payroll reports to json where basically just each piece of data per person is just a key and value so “gross pay: 1,000, name: Bob marley” etc. etc… I have a pretty good ocr setup from microsoft but GPT4 cannot seem to accurately produce the correct JSON (yes i have it turned on) that i need. Any tips ?
nah because the formats differ a-lot, it can be regular payroll pdfs, excel files etc. etc. sometimes just even a text file someone drew up. Just genuinely surprised this seems like a task that LLM’s should handle.
Curious but what if we don’t know the exact labels before hand ? We just want each data point per person ? + the documents are pretty large. Some payroll reports will just have gross pay per person and some will have 10 data points per person, like gross, net, taxes paid, commission etc. etc.
But, there is a lot of noise if you decide to do this. WHich is probably what you’re struggling with. It helps TREMENDOUSLY to have a pre-defined schema