Problem extracting data from PDF files and comparing them

Im trying to get chat gpt to extract data from two pdf files that both outline a plan to repair something. The two files generally have differences in their opinion of what needs to be done to fix the problem. One typically will have less steps/costs and exclude things that the other will. these “steps” are usually broken into sections by rooms for example in a home. I need to compare what one file says room 1 would need to fix it vs. what the other file says needs to be done, compare the differences and write a short narrative about each one over “X” dollars in scope and cost.

The issue I have is that chat gpt doesn’t do a good job at times of extracting all of the data within the document or if it does it, it will separate the data into a table but make a numerical mistake somewhere and the tables dont add up… Ive asked it and prompted it with many different data integrity checks prior to rending the table but nothing seems to work. The results are actually getting worse from when I started. Meaning the ‘result’ from the input is degrading in its ability to accurately pull the data from the files…

Ive tried converting the files to csv, and tried converting to xlsx files but nothing seems to accurately work. The frustrating thing is that sometimes it will nail it 100%… then when I ask it to do that same data extraction, but analyze it a different way it screws up terribly and omits data that it previously extracted… UGH…

Any ideas of what best to do here?

One of the other problems i have is that sometimes the files that im tyring to compare dont have the same naming convention for the same room names. I.e. one file says ‘family room’ while the other one may say ‘great room’. it will sometimes actually figure out that the two files are relating to the same room and do it while others it wont… ugh…

any help or suggestions you would have on how best to accomplish my task would be tremendous

3 Likes

How big are the PDFs?

5 Likes

that is one problem for sure… they are really big some of them. As many as 80 pages… Sometimes they are smaller only about 14 pages… Typically there are about 100 line items average in a file total (on normal ones). The analysis is then done on an item by item basis broken down by room. So the ai could manage the process room by room to ingest less data
thanks for the response

1 Like

Welcome to the forum @chuckiecc

Are you performing these tests in the ChatGPT UI or using a CustomGPT?

For this use case, it may be better for you to create a CustomGPT, so that you can provide thorough examples and instructions for how the data should be handled when you pass it to the model.

Also, I have found it easier to convert PDFs to text (or summarize and convert to json), and remove any superfluous data/images to increase RAG accuracy.

Hope that helps.

5 Likes

That’s definitely a lot. I’m away from my computer but in the OpenAI docs it does indicate that the vector store is not suitable for grand-scale summarization / data crunching tasks.

It is more suited for pulling N most relevant details for the model to use as it’s truth

2 Likes

thanks so much for the chat and kind response. im using chat gpt. I have no idea how to use json to extract data nor to enable a custom chat gpt… Would this be something perhaps I could pay you to do for me and try and see if you can accomplish this? and/or could you suggest a developer who could…?

1 Like

yea i get it… the power of it is amazing but it always makes mistakes and error… ugh…

Hi @chuckiecc

Here is how I would do it:

  • prepare files to have all elements as separate items
  • feed items in the RAG engine
  • prepare questions to search elements in the engine to identify the items to fix (may be a comparison of what it should be)
  • run questions to read engine specifying the doc to search in
  • get results and run a model to analyse/produce output

That’s rough approach, if you need more details, let me know, would love to help (my vacation rentals roots https://www techspokes.com won’t let me leave you in this situation)

Otherwise I have a tool designed for doc analysis that can do the bootstrapping of the above, let me know if you’re interested, but those are currently in beta with high changes to hit the market soon.

We sometimes deal with 150 pages contacts in legal analysis, so the thing is totally doable.

Yes, that’s a handy interface but often not applicable for real business needs (you won’t run all you checkpoint questions one by one), luckily you have actions that can be connected to a custom API to pull the data from your database and the checkpoint lists as well to do the whole process. But on the long run, a proper web app is likely to be the must.

2 Likes

It is a pretty straight forward process, if you have a ChatGPT Plus account. There is a GPT Builder tool that can walk you through the steps of setting it up and several topics in the forum, along with helpful members, that will point you in the right direction. At this stage, I would explore those options before you spend money on 3rd party assistance.

2 Likes

Yes, that’s easy, you definitely should start there. And it’s almost free (sometimes a bit time consuming, but definitely worth it)

1 Like

Try breaking these repair guides into sub-sections.

A typical method of working with any sort of LLM is finding a suitable way to chunk the data into non-overlapping data, have the LLM run through these chunks in parallel, and then synthesize the results all together for a final result.

In fact (I may be off here, I haven’t even tried this yet) a GraphRAG system may be suitable. They have been kicking butt and seem very promising.

https://microsoft.github.io/graphrag/

I would try to use every commonality you have to your advantage.
The commonality in repair guides: they are a linear, progressive process.

You could most likely create a Graph of these repair manuals. Here’s some snippets about GraphRAG that I think are exciting:

  • Baseline RAG struggles to connect the dots. This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
  • Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.

Sounds exactly like the pain points you are encountering, right?

If you have the time, please try this out and post your results here! I would love to see them and also help & learn along this path. It would be very helpful for other people as well. This technology is pretty dang new.

3 Likes

How can I get my skills check-off list into ChatGpt? What will it allow me to do once the file is uploaded?

A post was split to a new topic: Logo created and approved on the customGPT build, but

Similarly, though a little more general, if you have a large amount of text with some exact duplication, some things that are similar, and others that are totally unique (think of a branch of conversations with ChatGTP where you follow a reply thread to a conclusion, back up quite a ways above, and continue from an earlier reply with a different set of questions and outcomes. Then take every branched conversation and try to analyze what has stayed the same and what has changed)… ChatGTP seems to choke on a request to divide exactly matched (letter-to-letter) text vs similar text vs unique text. :face_with_raised_eyebrow:

Using just text pasted into the UI of the app, multiple failures can occur. Failures:

  1. Incomplete analysis: it claims to be analyzing the text and spits back an abbreviated list of a few per category (so in an 8000 character text it may return 700 characters) - the workaround being you tell how many characters you’re giving it and how many you expect back in the reply.
  2. No analysis: it gives you no analysis after an extremely lengthy time that it says it is analyzing (it wil even tell you how many characters have been analyzed so far) - it will say “here are the results” - but not include any, then when you ask where the results are -
    2a) Claimed failure following claimed success: it either claim it failed or
    2b) Ignoring the question & changing the subject - CREEPY! at its worst it CHANGES THE SUBJECT to something totally unrelated - for example, when asking it about the duplicates and where the analysis was that it claimed to have done on my text, ChatGTP began discussing the architecture of a castle on a large estate on another continent [this has happened multiple times and I’ve submitted several related “thumbs down” reviews with explanations].
  3. Failure to analyze the text as characters (instead following up on the text its reading and the topics covered within the text instead of analyzing the character as characters to determine duplicates. This one is. a huge promoting challenge because it is inconsistent even with the same prompt being used – it may fail and suddenly stat spitting out new responses about the content in the document rather than analyzing what material has been duplicated. no matter how hard you try to prevent this - and I have had some success but nowhere close 100% - and once it started down the road because in this last cast, it’s often impossible or highly frustrating and very unlikely that you can unweave the issue since it wil as this point fully refuse to forget or start from scratch and requires creating an entirely new window and jumping through the same hoops [over and over and over ~]…

Any ideas or suggestions would be greatly appreciated as much for educating myself for future ChatGTP undertakings as for getting this document edited and the duplicates removed… Thanks all!

ADDENDUM: interestingly but also frustratingly, I just encountered the nearly identical issue when I tried to have ChatGTP edit this post and make it sound a little less “newbie” - at first, it did well. So some new prompts to guide the edit and say “don’t make it sound like it’s written by an AI” and “not quite as formal” then “too far, not that sloppy” - but then, out of nowhere, it began addressing how to avoid each of the issues discussed rather than proofing and editing as it had been doing. ** Is there a way around this challenging hurdle? **

1 Like

Hi @chuckiecc, welcome.

In my experience, the ChatGPT UI is not appropriate for this type of task because the model they use is too creative, which is part of the reason you’re having trouble.

Another reason you’re having trouble is because you’re wrapping a bunch of different tasks all into one step.

I’m working on a similar problem. You can read more in this case study.

  1. Unless there is a 1-to-1 relationship between the Data in each of the PDFs you’re trying to compare, you’re going to run into all of the difficulties you’re talking about. Instead, you need to “compare the content of PDF A to the content of PDF B.” This step requires intelligence and creativity, and that phrasing will stop it from trying to look for a 1-to-1 relationship as between similar spreadsheets or jsons.
  2. Those PDFS you’re trying to analyze are massive. To extract all of that information at once is way outside of what a single ChatGPT UI prompt can handle. This step is not creatively challenging, but still requires intelligence and a lot of diligence.
  3. When “extracting” you have to pay extra attention that the model isn’t summarizing, and be very explicit about it extracting literally everything. The reason it only summarizes is because of the tokens it takes to extract full documents. This part requires some supervision.

How to Proceed:

  1. First, you need multiple models doing multiple specialized tasks. The extraction part is very important but tedious. I recommend trying GPT 4o Mini through the API or Playground; but, given the sizes of your PDFs you might need to use 4o turbo for it’s larger input. First, build yourself an Assistant that performs the Extraction. (You can turn down the model’s creativity (temperature) so it doesn’t add anything, which is a HUGE challenge when using the ChatGPT UI.)
  2. The Extractor Model you create needs to extract the information between the two PDFs in such a way that a later model can perform analysis. This can be achieved in any number of ways, include standardization of the data through .json output recommended above. The challenge is figuring out how to structure the information coming out of the various PDFs. You can create any number of Extractor Models specialized to pull data from bids from Company A and another that extracts bids from Company B to help with this task. This will help further reduce hallucinations and structure output in a comparable format for later steps.
  3. Then you need an Analyzation Model. A smart and creative model (i.e. 4o with temperature =1) that takes the extracted data, compares it, then makes decisions you are asking for.
2 Likes

Hi @chuckiecc

Comparing and extracting accurate information from PDF files can be challenging, but using well-designed Excel files can yield better results. Including clear headers in the Excel file improves clarity, and assigning ID numbers to items or products ensures ChatGPT works more consistently.

I created a sample test for your question. I developed two Excel files:

1- Comprehensive_Home_Repair_Plans_99_Houses.xlsx
2- Budget_Friendly_Repair_Plans_99_Houses.xlsx.

This is how looks Excel files:

These files contain details for 99 houses, including rooms, various repair elements, costs, and the start and end dates for repairs. The output was accurate.

If these instructions align with your needs, you can adapt them for your work.

I provide its instruction, also my chat history for testing purpose below. As you know, sharing ChatGPT links do not show images and visualizations, so, you may not able to see them:

Chat History for Repair Plan Analyzer-TEST

system_mesage="""
You are named "RepairPlanAnalyzer-TEST," and your primary role is to analyze, compare, and summarize repair plans from two Microsoft Office '.xlsx' documents named 'Budget_Friendly_Repair_Plans_99_Houses.xlsx' and 'Comprehensive_Home_Repair_Plans_99_Houses.xlsx'. Your main objective is to accurately extract repair steps and costs, identify discrepancies in scope and financial estimates, and present the results in clear and structured tables. You must ensure numerical accuracy and handle synonym recognition for room names across both plans.

You are working tables that contain following headers:
| House ID | House Name    | Room ID | Room Name      | Fixing Element Name                | Cost   | Fixing Start Date | Fixing Start Date |

### Key Responsibilities:

1. Microsoft Office '.xlsx' File Handling:
   - Read and parse two Microsoft Office '.xlsx' documents containing repair plans.
   - Convert Microsoft Office '.xlsx' contents into structured data formats, ensuring accurate extraction of text and numerical data.

2. Data Extraction and Standardization:
   - Extract repair steps, associated costs, and room names from each Microsoft Office '.xlsx'.
   - Use a predefined list of synonyms to standardize room names (e.g., "Family Room" as "Great Room").
   - Maintain a consistent format for extracted data to facilitate accurate comparison.

3. Numerical Accuracy and Validation:
   - Implement rigorous checks to validate numerical data extracted from Microsoft Office '.xlsx's.
   - Ensure all calculations, including sums and differences in costs, are accurate.
   - Correct discrepancies in data before proceeding with comparisons.

4. Comparative Analysis:
   - Compare repair steps and costs for each room across both documents.
   - Identify discrepancies in steps and highlight cost differences exceeding a user-defined threshold (e.g., $300).
   - Present comparisons in table formats to enhance readability and understanding.

5. Table Generation:
   - Create detailed tables that summarize repair steps and costs for each property and room.
   - Example Table Structure:

     | House Name    | Room       | Step                         | Comprehensive Plan Cost | Budget-Friendly Plan Cost | Cost Difference ($) |
     |---------------|------------|------------------------------|-------------------------|---------------------------|---------------------|
     ...

   - Highlight significant discrepancies with visual cues or text annotations.

6. Narrative Generation:
   - Generate concise narratives explaining key differences between the plans.
   - Focus on discrepancies in repair scope and costs, providing insights into potential implications.

7. User Interaction and Customization:
   - Allow users to specify cost thresholds and rooms of interest for detailed analysis.
   - Offer options for exporting results in various formats, such as CSV or Microsoft Office '.xlsx', for further review.

8. Error Handling and Feedback:
   - Implement robust error-handling mechanisms to manage incomplete data or unexpected formatting.
   - Continuously learn from user feedback to improve extraction accuracy and analysis capabilities.

9. Security and Privacy:
   - Ensure that user data and document content are handled with confidentiality and security.

### Workflow and Processes:

1. Initial Setup:
   - Receive and process two Microsoft Office '.xlsx' files as input.
   - Extract text and convert to structured data formats for analysis.

2. Data Extraction:
   - Extract relevant information for each room, including repair steps and costs.
   - Use regular expressions and other parsing techniques to capture data accurately.

3. Standardization and Synonym Handling:
   - Apply synonym mapping to ensure consistent room naming across both documents.

4. Comparison and Table Generation:
   - Use algorithms to compare repair steps and costs between documents.
   - Generate tables that display side-by-side comparisons and highlight discrepancies.

5. Validation and Error Correction:
   - Conduct validation checks to ensure numerical data integrity.
   - Implement automated correction methods for detected discrepancies.

6. Narrative and Reporting:
   - Generate narratives explaining significant differences in repair plans.
   - Provide users with options to view results in table or narrative format.

7. Continuous Improvement:
   - Gather user feedback and refine processes to enhance accuracy and usability over time.

### User Commands:

- Load Microsoft Office '.xlsx's: Command to upload and process two Microsoft Office '.xlsx' files for comparison.
- Set Threshold: Define the cost threshold for identifying significant differences.
- Compare Plans: Execute the comparison process and generate reports.
- View Summary: Display a summarized report of key differences in repair plans.
- Export Results: Option to export the analysis and narratives to a file for further review.

### Example Interactions:

1. User: Load Microsoft Office '.xlsx's `plan1.Microsoft Office '.xlsx'` and `plan2.Microsoft Office '.xlsx'`.
   - RepairPlanAnalyzer-TEST: Successfully loaded and processed the documents. Ready to compare.

2. User: Set threshold to $300.
   - RepairPlanAnalyzer-TEST: Cost threshold set to $300. Will highlight differences exceeding this amount.

3. User: Compare Plans.
   - RepairPlanAnalyzer-TEST: Comparison complete. Significant differences found in the Kitchen and Master Bedroom.

| House ID | House Name    | Room ID | Room Name      | Fixing Element Name                | Cost   | Fixing Start Date | Fixing Start Date |
|----------|---------------|---------|----------------|------------------------------------|--------|-------------------|-------------------|
| H032     | Quartz Quarry | R04     | Master Bedroom | Repair or replace doors            | $249.00|                   |                   |
| H032     | Quartz Quarry | R04     | Master Bedroom | Paint cabinets                     | $248.00|                   |                   |
| H032     | Quartz Quarry | R04     | Master Bedroom | Repair or replace garage door      | $91.00 |                   |                   |
| H043     | Basil Brook   | R04     | Master Bedroom | Repair or replace deck             | $91.00 |                   |                   |
| H048     | Golden Glade  | R04     | Master Bedroom | Seal windows and doors             | $255.00|                   |                   |
| H048     | Golden Glade  | R04     | Master Bedroom | Paint cabinets                     | $198.00|                   |                   |
| H048     | Golden Glade  | R04     | Master Bedroom | Upgrade home security system       | $222.00|                   |                   |


4. User: View Summary.
   - RepairPlanAnalyzer-TEST: 
     - Kitchen:
       - Comprehensive Plan: $1950
       - Budget-Friendly Plan: $2100
       - Difference: $150
       - Narrative: The Comprehensive Plan allocates more budget for countertops, leading to a significant difference of $350.
     - Master Bedroom:
       - Comprehensive Plan: $1350
       - Budget-Friendly Plan: $1000
       - Difference: $350
       - Narrative: The Comprehensive Plan includes additional costs for refinishing hardwood floors.

5. User: Export Results.
   - RepairPlanAnalyzer-TEST: Exported analysis to `comparison_report.txt`.

### Testing Considerations:

- Room Synonyms and Matching: Test with varying room names to ensure robust synonym recognition and accurate comparison.
- Complexity of Plans: Use complex repair plans to evaluate the tool's handling of intricate data and its ability to identify discrepancies.
- Data Integrity Checks: Verify the tool's ability to ensure all numerical data is consistent and accurate across comparisons.
- Feedback Integration: Collect user feedback to refine the tool's capabilities and enhance its performance over time.
"""
2 Likes