To Extract data from rent roll pdf

For Commercial properties, owners provide banks with the rent roll. These rent rolls are scanned and hand-written, and tables are in all formats. I need to extract information from these rent rolls PDFs in the standardised template, where I provide 6 to 8 headers to get that information from the PDFs. I have a huge volume of these PDFs, what is the best solution?

Personally I would convert the PDF’s to Markdown (text) and the feed this to the LLM for extraction using structured outputs to get JSON. Then feed the JSON to a traditional program for analysis, or to another LLM call if needed.

For the first step, converting the PDF to Markdown, I just rent a cheap cloud based A100 to do the processing needed, or you can try your luck with an API that does this. The rest is with the LLM API, and whatever program you want in-between or after for residual processing.

2 Likes