Best Way to Process 2500 large PDFs for Specific Data Extraction?

I’m working on a project to process 2500 PDF files (~50 pages each. I need to extract around 42 specific fields per document (e.g., Company Name, some numbers and percentages) using a language model like GPT to answer these 42 questions.

  1. Custom Data Extraction: Each PDF has their own type of structure, they are not unified so I need to find related text for each question.
  2. Token Limits with GPT: After extraction, I’ll need to send text to GPT for question-answering, but I’m concerned about token limits due to document size.
  3. Automating the Workflow: Ideally, I’d like to set up a pipeline to handle uploads, extraction, and querying.

Questions:

  1. What’s the best way to accurately extract and structure this data, langchain and openai? I tried text segmentation and then propelling but some data loss is happening when I try segmentation for solving token limit problem.
  2. Any suggestions on handling token limits when sending large text chunks to GPT?
  3. Tips on setting up an automated, scalable workflow? I currently use langchain.

Thanks for any insights!

3 Likes

I don’t have an answer, just two queries.

  1. Have you asked GPT its own methodology to ensure consistently correct output?
  2. Don’t name them, but are there viable alternatives? - I’m sure GPT/others might be able to summarise

TUT

This may be of use to you.

https://platform.openai.com/docs/guides/structured-outputs