Best Way to Process 2500 large PDFs for Specific Data Extraction?

hosseini99.zahra · November 2, 2024, 7:08pm

I’m working on a project to process 2500 PDF files (~50 pages each. I need to extract around 42 specific fields per document (e.g., Company Name, some numbers and percentages) using a language model like GPT to answer these 42 questions.

Custom Data Extraction: Each PDF has their own type of structure, they are not unified so I need to find related text for each question.
Token Limits with GPT: After extraction, I’ll need to send text to GPT for question-answering, but I’m concerned about token limits due to document size.
Automating the Workflow: Ideally, I’d like to set up a pipeline to handle uploads, extraction, and querying.

Questions:

What’s the best way to accurately extract and structure this data, langchain and openai? I tried text segmentation and then propelling but some data loss is happening when I try segmentation for solving token limit problem.
Any suggestions on handling token limits when sending large text chunks to GPT?
Tips on setting up an automated, scalable workflow? I currently use langchain.

Thanks for any insights!

thinkuptoday · November 3, 2024, 7:37am

I don’t have an answer, just two queries.

Have you asked GPT its own methodology to ensure consistently correct output?
Don’t name them, but are there viable alternatives? - I’m sure GPT/others might be able to summarise

TUT

Foxalabs · November 3, 2024, 7:55am

This may be of use to you.

https://platform.openai.com/docs/guides/structured-outputs

Topic		Replies	Views
Train GPT for analyze large number of pdf Community chatgpt	8	2352	August 2, 2024
Efficiently Interacting with super super Long PDFs/documents API gpt-4	2	1537	June 25, 2024
Seeking Advice: Uploading Large PDFs for Analysis with GPT-3 API API gpt-35-turbo , chatgpt , fine-tuning , api	7	7222	December 13, 2023
Document processing solutions API chatgpt , plugin-development , api , assistants-api	6	5110	April 3, 2024
Guidance on summarizing questions from a collection of pdfs Prompting gpt-4 , api	1	891	October 29, 2023

Best Way to Process 2500 large PDFs for Specific Data Extraction?

Related topics