Problem extracting data from PDF files and comparing them

thinktank · August 5, 2024, 7:39pm

In my experience, the ChatGPT UI is not appropriate for this type of task because the model they use is too creative, which is part of the reason you’re having trouble.

Another reason you’re having trouble is because you’re wrapping a bunch of different tasks all into one step.

I’m working on a similar problem. You can read more in this case study.

Unless there is a 1-to-1 relationship between the Data in each of the PDFs you’re trying to compare, you’re going to run into all of the difficulties you’re talking about. Instead, you need to “compare the content of PDF A to the content of PDF B.” This step requires intelligence and creativity, and that phrasing will stop it from trying to look for a 1-to-1 relationship as between similar spreadsheets or jsons.
Those PDFS you’re trying to analyze are massive. To extract all of that information at once is way outside of what a single ChatGPT UI prompt can handle. This step is not creatively challenging, but still requires intelligence and a lot of diligence.
When “extracting” you have to pay extra attention that the model isn’t summarizing, and be very explicit about it extracting literally everything. The reason it only summarizes is because of the tokens it takes to extract full documents. This part requires some supervision.

How to Proceed:

First, you need multiple models doing multiple specialized tasks. The extraction part is very important but tedious. I recommend trying GPT 4o Mini through the API or Playground; but, given the sizes of your PDFs you might need to use 4o turbo for it’s larger input. First, build yourself an Assistant that performs the Extraction. (You can turn down the model’s creativity (temperature) so it doesn’t add anything, which is a HUGE challenge when using the ChatGPT UI.)
The Extractor Model you create needs to extract the information between the two PDFs in such a way that a later model can perform analysis. This can be achieved in any number of ways, include standardization of the data through .json output recommended above. The challenge is figuring out how to structure the information coming out of the various PDFs. You can create any number of Extractor Models specialized to pull data from bids from Company A and another that extracts bids from Company B to help with this task. This will help further reduce hallucinations and structure output in a comparable format for later steps.
Then you need an Analyzation Model. A smart and creative model (i.e. 4o with temperature =1) that takes the extracted data, compares it, then makes decisions you are asking for.

Topic		Replies	Views
Poor quality response on trained LLM with pdf files Community gpt-4	29	6271	May 1, 2024
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4470	January 26, 2024
How to confirm that you got the correct value from a text other than repeating the same prompt over and over API	39	887	September 1, 2024
Prompt Fatigue Question For API Calls Prompting gpt-35-turbo	24	496	January 25, 2025
Search long pdf for specific table - possibly need fine tuning model API gpt-4 , fine-tuning , api	10	3068	March 29, 2024

Problem extracting data from PDF files and comparing them

Related topics