GPT not completing a document processing task as offered

I’m interacting with GPT-4o and trying to get assistance with a document processing task. I have two quite large (700-page) PDFs, images only, with a lot of duplicated sub-sections. I want to merge them into one deduplicated PDF, but with sub-documents staying intact (usually 2-5 pages each). I asked GPT 4o for a process outline to do this, and was given a sensible-looking pipeline response involving image hashing, deduplication and so on. We agreed no OCR in the initial attempt, as it would be much slower.

The proposed method was detailed:

Before we proceed with processing Bundle 1 and Bundle 2, here’s a summary of the key steps and rules for deduplication and reorganization that we can agree upon:

Processing Steps:

  1. Extract pages from both PDF bundles:
  • We will extract all pages from both Bundle 1 and Bundle 2 to identify and compare sub-documents (groups of pages that belong together).
  1. Image-based comparison:
  • Since these are image-based PDFs, we will perform an image-based comparison to identify duplicate pages between Bundle 1 and Bundle 2.
  • Any pages that appear to be identical (or visually very similar) will be flagged as duplicates.
  1. Remove duplicates:
  • After identifying duplicate pages, we’ll remove them from the final output, keeping only one version of each sub-document.
  1. Reorganizing the pages:
  • The pages from each bundle will need to be reorganized to ensure the order of sub-documents is preserved.
  • Index pages (identifying the exhibits or collections) should stay with the pages they reference, even if those index pages have a different format from the main content. We’ll make sure these index pages are kept in place.
  1. Final output:
  • The resulting file will contain all unique sub-documents from both bundles, with duplicates removed, and pages will be in the correct order as per the original documents.

Rules for Reorganizing Pages:

  • Index pages (such as those that reference exhibits or groups of sub-documents) must stay with the pages they refer to. If an index page refers to a set of pages from either bundle, that set of pages should immediately follow the index.
  • Sub-documents should appear in the order in which they were originally presented across both bundles.
  • If any content is missing from Bundle 2 (e.g., due to formatting or omissions), it should be flagged for potential manual review.
  • Sub-document matching will be based on visual similarity, and any pages that are deemed to be identical (either by image or visual similarity) will be considered duplicates.

Next Steps:

Once we confirm the plan, I will begin the processing with the two bundles. If there are any specific rules you’d like to adjust, or if you’d like to include any special handling for particular pages or exhibits, please let me know!

As I don’t know the internal capabilities of the backend, I accepted the proposed method (which was refined in several rounds of prompting). However, after uploading the PDFs, I had simply endless delays and each time I prompted for an update, I got a “politician’s answer” that for one reason or another the task had not been actually started, and a repeated confirmation that it would now begin. After repeating this several times, over several hours (so plenty of actual processing time), I gave up. It seems that the backend tools suggested were simply not able to do the task.

What is the “fix” for GPT just giving untrue answers to what is being done by a proposed task pipeline, instead of returning true information about what the system itself is doing?
If any devs are here and can look at the interaction, the chat ID was 67d03884-65e4-8012-8a2c-7cf18fb32306

any comments welcome.
thanks,
Miles

Hey,

I had the same problem and was able to solve it with open AI API . From your response I can see that you used the web interface, the issue you would face with this is the the web interface is not compatible with very large PDFs as it starts to hallucinate. You can try leveraging the open AI API , try chunking your PDFs and giving it to the API .