Extracting Data From PDFs

Hey there!

I am experimenting extracting the same information across multiple pdfs given a variable(s). This specific instance is extracting financial information from different pitch decks. For example, the variable would be CY Revenue and it would extract the CY Revenue from each pdf.

I have lots of past PDFs I could train the model on with the variable and correct answers. Does anyone have any recommendations on the best way to do this?


You need to step back and figure out what you want as a result of this.

Why do you want to fine-tune a model on random numbers? What is your end goal? Why do you think a Large Language Model is suitable for this task?

This seems like a pretty big job that requires some extensive knowledge in numerous domains. You’re looking at a variety of different tools built for their own purpose and then harmonized together.

At the very least you will possibly need an OCR to read and correctly parse the content. You would need a different form of AI to make sense of the numbers, and then you could consider using an LLM to somehow correlate the numbers to the semantics of the document.

Simply put: if you shovel a bunch of numbers to an LLM to try and find an underlying numerical pattern, you will have better chances throwing dice or using astrology.

If you wish to try it anyways, I’d start with a PDF parser. Don’t use a typical Python Library that extracts text. They aren’t reliable enough.

Here’s an open source version specifically purposes for RAG, which to me would also be useable for extracting key content:

You could also try and get away with Tabula for data tables in PDFs:


I feel your pain! A challenging task. In my experience we need AGI for that :slight_smile: at least to do it reliably. PDF’s come in so many forms - until we have the smart OCR that can actually ‘read’ each slide in context it will be a huge challenge. Because there could be a table with the revenue (column by FY) or it could be lines or it could just be ‘2022’ hanging on a nice slide with $200m floating under it.

Also curious is this for ‘incoming pitchdecks’ - or are you trying to process a huge stack of pitchdecks that your are trying to get stats from? I have a lot of experience with ‘incoming’ - and certainly have some thoughts to share about those.

If I understand correctly, this is for incoming pitchdecks. For example, a new pitchdeck would come in and provided the variables I would like it could return the values.