I’m working on a project to process 2500 PDF files (~50 pages each. I need to extract around 42 specific fields per document (e.g., Company Name, some numbers and percentages) using a language model like GPT to answer these 42 questions.
- Custom Data Extraction: Each PDF has their own type of structure, they are not unified so I need to find related text for each question.
- Token Limits with GPT: After extraction, I’ll need to send text to GPT for question-answering, but I’m concerned about token limits due to document size.
- Automating the Workflow: Ideally, I’d like to set up a pipeline to handle uploads, extraction, and querying.
Questions:
- What’s the best way to accurately extract and structure this data, langchain and openai? I tried text segmentation and then propelling but some data loss is happening when I try segmentation for solving token limit problem.
- Any suggestions on handling token limits when sending large text chunks to GPT?
- Tips on setting up an automated, scalable workflow? I currently use langchain.
Thanks for any insights!