Problem extracting data from PDF files and comparing them

chuckiecc · August 1, 2024, 4:58pm

Im trying to get chat gpt to extract data from two pdf files that both outline a plan to repair something. The two files generally have differences in their opinion of what needs to be done to fix the problem. One typically will have less steps/costs and exclude things that the other will. these “steps” are usually broken into sections by rooms for example in a home. I need to compare what one file says room 1 would need to fix it vs. what the other file says needs to be done, compare the differences and write a short narrative about each one over “X” dollars in scope and cost.

The issue I have is that chat gpt doesn’t do a good job at times of extracting all of the data within the document or if it does it, it will separate the data into a table but make a numerical mistake somewhere and the tables dont add up… Ive asked it and prompted it with many different data integrity checks prior to rending the table but nothing seems to work. The results are actually getting worse from when I started. Meaning the ‘result’ from the input is degrading in its ability to accurately pull the data from the files…

Ive tried converting the files to csv, and tried converting to xlsx files but nothing seems to accurately work. The frustrating thing is that sometimes it will nail it 100%… then when I ask it to do that same data extraction, but analyze it a different way it screws up terribly and omits data that it previously extracted… UGH…

Any ideas of what best to do here?

One of the other problems i have is that sometimes the files that im tyring to compare dont have the same naming convention for the same room names. I.e. one file says ‘family room’ while the other one may say ‘great room’. it will sometimes actually figure out that the two files are relating to the same room and do it while others it wont… ugh…

any help or suggestions you would have on how best to accomplish my task would be tremendous

anon10827405 · August 1, 2024, 5:10pm

How big are the PDFs?

chuckiecc · August 1, 2024, 6:03pm

that is one problem for sure… they are really big some of them. As many as 80 pages… Sometimes they are smaller only about 14 pages… Typically there are about 100 line items average in a file total (on normal ones). The analysis is then done on an item by item basis broken down by room. So the ai could manage the process room by room to ingest less data
thanks for the response

BPS_Software · August 1, 2024, 7:23pm

Welcome to the forum @chuckiecc

Are you performing these tests in the ChatGPT UI or using a CustomGPT?

For this use case, it may be better for you to create a CustomGPT, so that you can provide thorough examples and instructions for how the data should be handled when you pass it to the model.

Also, I have found it easier to convert PDFs to text (or summarize and convert to json), and remove any superfluous data/images to increase RAG accuracy.

Hope that helps.

anon10827405 · August 1, 2024, 7:39pm

That’s definitely a lot. I’m away from my computer but in the OpenAI docs it does indicate that the vector store is not suitable for grand-scale summarization / data crunching tasks.

It is more suited for pulling N most relevant details for the model to use as it’s truth

chuckiecc · August 1, 2024, 7:40pm

thanks so much for the chat and kind response. im using chat gpt. I have no idea how to use json to extract data nor to enable a custom chat gpt… Would this be something perhaps I could pay you to do for me and try and see if you can accomplish this? and/or could you suggest a developer who could…?

chuckiecc · August 1, 2024, 7:41pm

yea i get it… the power of it is amazing but it always makes mistakes and error… ugh…

sergeliatko · August 1, 2024, 7:45pm

Hi @chuckiecc

Here is how I would do it:

prepare files to have all elements as separate items
feed items in the RAG engine
prepare questions to search elements in the engine to identify the items to fix (may be a comparison of what it should be)
run questions to read engine specifying the doc to search in
get results and run a model to analyse/produce output

That’s rough approach, if you need more details, let me know, would love to help (my vacation rentals roots https://www techspokes.com won’t let me leave you in this situation)

Otherwise I have a tool designed for doc analysis that can do the bootstrapping of the above, let me know if you’re interested, but those are currently in beta with high changes to hit the market soon.

We sometimes deal with 150 pages contacts in legal analysis, so the thing is totally doable.

Yes, that’s a handy interface but often not applicable for real business needs (you won’t run all you checkpoint questions one by one), luckily you have actions that can be connected to a custom API to pull the data from your database and the checkpoint lists as well to do the whole process. But on the long run, a proper web app is likely to be the must.

BPS_Software · August 1, 2024, 7:52pm

It is a pretty straight forward process, if you have a ChatGPT Plus account. There is a GPT Builder tool that can walk you through the steps of setting it up and several topics in the forum, along with helpful members, that will point you in the right direction. At this stage, I would explore those options before you spend money on 3rd party assistance.

sergeliatko · August 1, 2024, 7:54pm

Yes, that’s easy, you definitely should start there. And it’s almost free (sometimes a bit time consuming, but definitely worth it)

anon10827405 · August 1, 2024, 8:00pm

Try breaking these repair guides into sub-sections.

A typical method of working with any sort of LLM is finding a suitable way to chunk the data into non-overlapping data, have the LLM run through these chunks in parallel, and then synthesize the results all together for a final result.

In fact (I may be off here, I haven’t even tried this yet) a GraphRAG system may be suitable. They have been kicking butt and seem very promising.

https://microsoft.github.io/graphrag/

I would try to use every commonality you have to your advantage.
The commonality in repair guides: they are a linear, progressive process.

You could most likely create a Graph of these repair manuals. Here’s some snippets about GraphRAG that I think are exciting:

Baseline RAG struggles to connect the dots. This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.

Sounds exactly like the pain points you are encountering, right?

If you have the time, please try this out and post your results here! I would love to see them and also help & learn along this path. It would be very helpful for other people as well. This technology is pretty dang new.

yvette.jackson · August 2, 2024, 12:54pm

How can I get my skills check-off list into ChatGpt? What will it allow me to do once the file is uploaded?

PaulBellow · August 3, 2024, 10:34pm

A post was split to a new topic: Logo created and approved on the customGPT build, but

dreaming4evermore · August 5, 2024, 9:49am

Similarly, though a little more general, if you have a large amount of text with some exact duplication, some things that are similar, and others that are totally unique (think of a branch of conversations with ChatGTP where you follow a reply thread to a conclusion, back up quite a ways above, and continue from an earlier reply with a different set of questions and outcomes. Then take every branched conversation and try to analyze what has stayed the same and what has changed)… ChatGTP seems to choke on a request to divide exactly matched (letter-to-letter) text vs similar text vs unique text.

Using just text pasted into the UI of the app, multiple failures can occur. Failures:

Incomplete analysis: it claims to be analyzing the text and spits back an abbreviated list of a few per category (so in an 8000 character text it may return 700 characters) - the workaround being you tell how many characters you’re giving it and how many you expect back in the reply.
No analysis: it gives you no analysis after an extremely lengthy time that it says it is analyzing (it wil even tell you how many characters have been analyzed so far) - it will say “here are the results” - but not include any, then when you ask where the results are -
2a) Claimed failure following claimed success: it either claim it failed or
2b) Ignoring the question & changing the subject - CREEPY! at its worst it CHANGES THE SUBJECT to something totally unrelated - for example, when asking it about the duplicates and where the analysis was that it claimed to have done on my text, ChatGTP began discussing the architecture of a castle on a large estate on another continent [this has happened multiple times and I’ve submitted several related “thumbs down” reviews with explanations].
Failure to analyze the text as characters (instead following up on the text its reading and the topics covered within the text instead of analyzing the character as characters to determine duplicates. This one is. a huge promoting challenge because it is inconsistent even with the same prompt being used – it may fail and suddenly stat spitting out new responses about the content in the document rather than analyzing what material has been duplicated. no matter how hard you try to prevent this - and I have had some success but nowhere close 100% - and once it started down the road because in this last cast, it’s often impossible or highly frustrating and very unlikely that you can unweave the issue since it wil as this point fully refuse to forget or start from scratch and requires creating an entirely new window and jumping through the same hoops [over and over and over ~]…

Any ideas or suggestions would be greatly appreciated as much for educating myself for future ChatGTP undertakings as for getting this document edited and the duplicates removed… Thanks all!

ADDENDUM: interestingly but also frustratingly, I just encountered the nearly identical issue when I tried to have ChatGTP edit this post and make it sound a little less “newbie” - at first, it did well. So some new prompts to guide the edit and say “don’t make it sound like it’s written by an AI” and “not quite as formal” then “too far, not that sloppy” - but then, out of nowhere, it began addressing how to avoid each of the issues discussed rather than proofing and editing as it had been doing. ** Is there a way around this challenging hurdle? **

thinktank · August 5, 2024, 7:39pm

Hi @chuckiecc, welcome.

In my experience, the ChatGPT UI is not appropriate for this type of task because the model they use is too creative, which is part of the reason you’re having trouble.

Another reason you’re having trouble is because you’re wrapping a bunch of different tasks all into one step.

I’m working on a similar problem. You can read more in this case study.

Unless there is a 1-to-1 relationship between the Data in each of the PDFs you’re trying to compare, you’re going to run into all of the difficulties you’re talking about. Instead, you need to “compare the content of PDF A to the content of PDF B.” This step requires intelligence and creativity, and that phrasing will stop it from trying to look for a 1-to-1 relationship as between similar spreadsheets or jsons.
Those PDFS you’re trying to analyze are massive. To extract all of that information at once is way outside of what a single ChatGPT UI prompt can handle. This step is not creatively challenging, but still requires intelligence and a lot of diligence.
When “extracting” you have to pay extra attention that the model isn’t summarizing, and be very explicit about it extracting literally everything. The reason it only summarizes is because of the tokens it takes to extract full documents. This part requires some supervision.

How to Proceed:

First, you need multiple models doing multiple specialized tasks. The extraction part is very important but tedious. I recommend trying GPT 4o Mini through the API or Playground; but, given the sizes of your PDFs you might need to use 4o turbo for it’s larger input. First, build yourself an Assistant that performs the Extraction. (You can turn down the model’s creativity (temperature) so it doesn’t add anything, which is a HUGE challenge when using the ChatGPT UI.)
The Extractor Model you create needs to extract the information between the two PDFs in such a way that a later model can perform analysis. This can be achieved in any number of ways, include standardization of the data through .json output recommended above. The challenge is figuring out how to structure the information coming out of the various PDFs. You can create any number of Extractor Models specialized to pull data from bids from Company A and another that extracts bids from Company B to help with this task. This will help further reduce hallucinations and structure output in a comparable format for later steps.
Then you need an Analyzation Model. A smart and creative model (i.e. 4o with temperature =1) that takes the extracted data, compares it, then makes decisions you are asking for.

polepole · August 7, 2024, 11:05pm

Hi @chuckiecc

Comparing and extracting accurate information from PDF files can be challenging, but using well-designed Excel files can yield better results. Including clear headers in the Excel file improves clarity, and assigning ID numbers to items or products ensures ChatGPT works more consistently.

I created a sample test for your question. I developed two Excel files:

1- Comprehensive_Home_Repair_Plans_99_Houses.xlsx
2- Budget_Friendly_Repair_Plans_99_Houses.xlsx.

This is how looks Excel files:

These files contain details for 99 houses, including rooms, various repair elements, costs, and the start and end dates for repairs. The output was accurate.

If these instructions align with your needs, you can adapt them for your work.

I provide its instruction, also my chat history for testing purpose below. As you know, sharing ChatGPT links do not show images and visualizations, so, you may not able to see them:

Chat History for Repair Plan Analyzer-TEST

system_mesage="""
You are named "RepairPlanAnalyzer-TEST," and your primary role is to analyze, compare, and summarize repair plans from two Microsoft Office '.xlsx' documents named 'Budget_Friendly_Repair_Plans_99_Houses.xlsx' and 'Comprehensive_Home_Repair_Plans_99_Houses.xlsx'. Your main objective is to accurately extract repair steps and costs, identify discrepancies in scope and financial estimates, and present the results in clear and structured tables. You must ensure numerical accuracy and handle synonym recognition for room names across both plans.

You are working tables that contain following headers:
| House ID | House Name    | Room ID | Room Name      | Fixing Element Name                | Cost   | Fixing Start Date | Fixing Start Date |

### Key Responsibilities:

1. Microsoft Office '.xlsx' File Handling:
   - Read and parse two Microsoft Office '.xlsx' documents containing repair plans.
   - Convert Microsoft Office '.xlsx' contents into structured data formats, ensuring accurate extraction of text and numerical data.

2. Data Extraction and Standardization:
   - Extract repair steps, associated costs, and room names from each Microsoft Office '.xlsx'.
   - Use a predefined list of synonyms to standardize room names (e.g., "Family Room" as "Great Room").
   - Maintain a consistent format for extracted data to facilitate accurate comparison.

3. Numerical Accuracy and Validation:
   - Implement rigorous checks to validate numerical data extracted from Microsoft Office '.xlsx's.
   - Ensure all calculations, including sums and differences in costs, are accurate.
   - Correct discrepancies in data before proceeding with comparisons.

4. Comparative Analysis:
   - Compare repair steps and costs for each room across both documents.
   - Identify discrepancies in steps and highlight cost differences exceeding a user-defined threshold (e.g., $300).
   - Present comparisons in table formats to enhance readability and understanding.

5. Table Generation:
   - Create detailed tables that summarize repair steps and costs for each property and room.
   - Example Table Structure:

     | House Name    | Room       | Step                         | Comprehensive Plan Cost | Budget-Friendly Plan Cost | Cost Difference ($) |
     |---------------|------------|------------------------------|-------------------------|---------------------------|---------------------|
     ...

   - Highlight significant discrepancies with visual cues or text annotations.

6. Narrative Generation:
   - Generate concise narratives explaining key differences between the plans.
   - Focus on discrepancies in repair scope and costs, providing insights into potential implications.

7. User Interaction and Customization:
   - Allow users to specify cost thresholds and rooms of interest for detailed analysis.
   - Offer options for exporting results in various formats, such as CSV or Microsoft Office '.xlsx', for further review.

8. Error Handling and Feedback:
   - Implement robust error-handling mechanisms to manage incomplete data or unexpected formatting.
   - Continuously learn from user feedback to improve extraction accuracy and analysis capabilities.

9. Security and Privacy:
   - Ensure that user data and document content are handled with confidentiality and security.

### Workflow and Processes:

1. Initial Setup:
   - Receive and process two Microsoft Office '.xlsx' files as input.
   - Extract text and convert to structured data formats for analysis.

2. Data Extraction:
   - Extract relevant information for each room, including repair steps and costs.
   - Use regular expressions and other parsing techniques to capture data accurately.

3. Standardization and Synonym Handling:
   - Apply synonym mapping to ensure consistent room naming across both documents.

4. Comparison and Table Generation:
   - Use algorithms to compare repair steps and costs between documents.
   - Generate tables that display side-by-side comparisons and highlight discrepancies.

5. Validation and Error Correction:
   - Conduct validation checks to ensure numerical data integrity.
   - Implement automated correction methods for detected discrepancies.

6. Narrative and Reporting:
   - Generate narratives explaining significant differences in repair plans.
   - Provide users with options to view results in table or narrative format.

7. Continuous Improvement:
   - Gather user feedback and refine processes to enhance accuracy and usability over time.

### User Commands:

- Load Microsoft Office '.xlsx's: Command to upload and process two Microsoft Office '.xlsx' files for comparison.
- Set Threshold: Define the cost threshold for identifying significant differences.
- Compare Plans: Execute the comparison process and generate reports.
- View Summary: Display a summarized report of key differences in repair plans.
- Export Results: Option to export the analysis and narratives to a file for further review.

### Example Interactions:

1. User: Load Microsoft Office '.xlsx's `plan1.Microsoft Office '.xlsx'` and `plan2.Microsoft Office '.xlsx'`.
   - RepairPlanAnalyzer-TEST: Successfully loaded and processed the documents. Ready to compare.

2. User: Set threshold to $300.
   - RepairPlanAnalyzer-TEST: Cost threshold set to $300. Will highlight differences exceeding this amount.

3. User: Compare Plans.
   - RepairPlanAnalyzer-TEST: Comparison complete. Significant differences found in the Kitchen and Master Bedroom.

| House ID | House Name    | Room ID | Room Name      | Fixing Element Name                | Cost   | Fixing Start Date | Fixing Start Date |
|----------|---------------|---------|----------------|------------------------------------|--------|-------------------|-------------------|
| H032     | Quartz Quarry | R04     | Master Bedroom | Repair or replace doors            | $249.00|                   |                   |
| H032     | Quartz Quarry | R04     | Master Bedroom | Paint cabinets                     | $248.00|                   |                   |
| H032     | Quartz Quarry | R04     | Master Bedroom | Repair or replace garage door      | $91.00 |                   |                   |
| H043     | Basil Brook   | R04     | Master Bedroom | Repair or replace deck             | $91.00 |                   |                   |
| H048     | Golden Glade  | R04     | Master Bedroom | Seal windows and doors             | $255.00|                   |                   |
| H048     | Golden Glade  | R04     | Master Bedroom | Paint cabinets                     | $198.00|                   |                   |
| H048     | Golden Glade  | R04     | Master Bedroom | Upgrade home security system       | $222.00|                   |                   |


4. User: View Summary.
   - RepairPlanAnalyzer-TEST: 
     - Kitchen:
       - Comprehensive Plan: $1950
       - Budget-Friendly Plan: $2100
       - Difference: $150
       - Narrative: The Comprehensive Plan allocates more budget for countertops, leading to a significant difference of $350.
     - Master Bedroom:
       - Comprehensive Plan: $1350
       - Budget-Friendly Plan: $1000
       - Difference: $350
       - Narrative: The Comprehensive Plan includes additional costs for refinishing hardwood floors.

5. User: Export Results.
   - RepairPlanAnalyzer-TEST: Exported analysis to `comparison_report.txt`.

### Testing Considerations:

- Room Synonyms and Matching: Test with varying room names to ensure robust synonym recognition and accurate comparison.
- Complexity of Plans: Use complex repair plans to evaluate the tool's handling of intricate data and its ability to identify discrepancies.
- Data Integrity Checks: Verify the tool's ability to ensure all numerical data is consistent and accurate across comparisons.
- Feedback Integration: Collect user feedback to refine the tool's capabilities and enhance its performance over time.
"""

sivapriya721 · June 4, 2025, 10:04pm

How if two PDFs I want to compare one has requirements as comments and other need to be verified whether those comments in pdf1 implemented in pdf2

sivapriya721 · June 4, 2025, 10:04pm

How do I evaluate and change or maintain prompts

Hasan_Kobir · June 5, 2025, 2:15am

This is amazing and especially normal interface better than others AI

NickPlanck · June 6, 2025, 10:47am

I spent days dealing this this myself and I will share somethings that I learned (Please correct me if i’m wrong as these are just my own observations):

Context window is important:
Based on your tier chatgpt you have different amount of context windows, for large PDFs you might be exceding its context window. GPT 4.1 has a 1 million token context window so anytime you want to get granular with large PDFs, I always use GPT 4.1. The cavat is, im not sure if ChatGPT 4.1 has that context window, I assume that 1 million token context window only applies if your using the API.

Deep Research does everything better (except for prompt engineering):
If your going to upload massive PDFs and you want to actually get granular with it, use Deep Research, it has served me well.

Best practice is using GPT 4.1 API:

As I mentioned earlier, GPT4.1 has a very large context window, more than capable of handling your large PDFs, the hurdle is making both a system prompt & a user prompt that will actually get you what you want. This is what I did when I wanted granular level insights on a 200 page PDF. I made an API account and used Playground to interact with the PDFs.

Actual Gold Standard —is probably doing what the users said above, but this in my opinion is more complicated than just using the Flagship GPT4.1 API on playground. I am going to try these approaches out myself to see if it yields better results :

A CustomGPT:
This is something I have heard a lot about recently. Im not sure if were talking about the same thing though. The custom GPT in ChatGPT i have tried before but honestly it didnt really impress me very much. Ive also heard about this in the API, like a customGPT/AI agent but I havn’t tried it. Also you could probably fine tune a model for the best results if you wanted it to work really well

Topic		Replies	Views
Poor quality response on trained LLM with pdf files Community gpt-4	29	6475	May 1, 2024
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4513	January 26, 2024
Building first RAG system API	17	884	July 6, 2025
How to confirm that you got the correct value from a text other than repeating the same prompt over and over API	39	958	September 1, 2024
Prompt Fatigue Question For API Calls Prompting gpt-35-turbo	24	597	January 25, 2025

Problem extracting data from PDF files and comparing them

This is how looks Excel files:

Chat History for Repair Plan Analyzer-TEST

Related topics