Compare PDF files and shows changes

Hello is there a way to compare PDF files and show the changes, specifically I want to try and compare Safety Data Sheets which are structured from 16 sections with their subsections. The comparison will be used for regression testing of old and new version to see what was changed. Should be doable with python and connection to OpenAI API. Any ideas or how to approach this to get the best outcome. The output can be in JSON format as well.

Yes, but a large language/vision model probably isn’t the right tool for this job.

It will be better, easier, and cheaper to simply extract the text from the PDFs and run them through diff.

You could use a language model though to convert the diff results into something more human-readable.

1 Like

Is there a way to do that in python or I have to pay diff?

I would start by using difflib.

Do you have a couple of example PDFs?

Edit: I went ahead and grabbed some MSDS files and whipped up a simple example.

from openai import OpenAI
from diff_match_patch import diff_match_patch
import subprocess

subprocess.run(['pdftotext', 'lysol_dec_2011.pdf', 'lysol_dec_2011.txt'])
subprocess.run(['pdftotext', 'lysol_jul_2010.pdf', 'lysol_jul_2010.txt'])

with open('lysol_jul_2010.txt', 'r') as file1, open('lysol_dec_2011.txt', 'r') as file2:
    text1 = file1.read()
    text2 = file2.read()

dmp = diff_match_patch()
text1 = '\n'.join(line.strip() for line in text1.split('\n'))
text2 = '\n'.join(line.strip() for line in text2.split('\n'))

text1_lines, text2_lines, line_array = dmp.diff_linesToChars(text1, text2)
diff = dmp.diff_main(text1_lines, text2_lines)
dmp.diff_charsToLines(diff, line_array)
dmp.diff_cleanupSemantic(diff)

diff_lines = []
for (op, data) in diff:
    if op == dmp.DIFF_DELETE:
        diff_lines.extend('- ' + line for line in data.split('\n') if line)
    elif op == dmp.DIFF_INSERT:
        diff_lines.extend('+ ' + line for line in data.split('\n') if line)
    else:
        diff_lines.extend('\n')
        diff_lines.extend('  ' + line for line in data.split('\n') if line)
        diff_lines.extend('\n')
diff_text = '\n'.join(diff_lines)

client = OpenAI()
response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful document diff analyzer.\n\nUsers will provide you with a diff report documenting the differences between two versions of a document.\n\nYou will explain, section-by-section the changes made between the initial and subsequent versions of the file."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": diff_text
        }
      ]
    }
  ],
  temperature=0
)

resp = response.choices[0]
print(resp.message.content)

And the generated output is below:


Here is a section-by-section explanation of the changes made between the initial and subsequent versions of the Material Safety Data Sheet (MSDS) for Lysol:

General Information

  • Revision Date:
    • Initial Version: July 15, 2010
    • Subsequent Version: December 14, 2011

SECTION 2 — COMPOSITION, INFORMATION ON INGREDIENTS

  • Initial Version:
    • “pine oil (8002-09-3) 9%, n-alkyl dimethyl benzyl ammonium chloride (68424-85-1) 1.3% and water (7732-18-5) >89%.”
  • Subsequent Version:
    • “Pine oil (8002-09-3) 9%, n-Alkyl dimethyl benzyl ammonium chloride (68424-85-1) 1.3% and Water (7732-18-5) >89%.”
    • Change: Capitalization of the chemical names “Pine oil,” “n-Alkyl dimethyl benzyl ammonium chloride,” and “Water.”

SECTION 3 — HAZARDS IDENTIFICATION

  • Initial Version:
    • “Brown, oily liquid with a distinctive phenol odor.”
  • Subsequent Version:
    • “Brown, oily liquid. Distinctive phenol odor.”
    • Change: The sentence was split into two separate sentences for clarity.

SECTION 4 — FIRST AID MEASURES

  • Initial Version:
    • “Eye: Immediately flush with fresh water for at least 15 minutes.”
    • “External: Wash continuously with mild soap and fresh water for at least 15 minutes.”
    • “Internal: Give large quantities of water. Call a physician or poison control at once.”
  • Subsequent Version:
    • “Eye or External: Immediately flush with fresh water for at least 15 minutes.”
    • “Internal: Rinse mouth. Give large quantities of water for dilution. Call a physician or poison control at once.”
    • Change: Combined “Eye” and “External” into one instruction. Added “Rinse mouth” to the “Internal” instruction for clarity.

SECTION 10 — STABILITY AND REACTIVITY

  • Initial Version:
    • “Shelf Life: Indefinite.”
  • Subsequent Version:
    • “Shelf life: Indefinite, if stored properly.”
    • Change: Added the condition “if stored properly” to the shelf life information.

SECTION 13 — DISPOSAL CONSIDERATIONS

  • Initial Version:
    • “Please consult with state and local regulations.”
  • Subsequent Version:
    • “Please review all federal, state and local regulations that may apply, before proceeding.”
    • Change: Expanded the instruction to include federal regulations and added a directive to review all applicable regulations before proceeding.

SECTION 14 — TRANSPORT INFORMATION

  • Initial Version:
    • “Shipping Name: Not regulated”
    • “Hazard Class: N/A”
    • “UN Number: N/A”
  • Subsequent Version:
    • “Shipping name: Not regulated”
    • “Hazard class: N/A”
    • “UN number: N/A”
    • Change: Minor formatting change, specifically the capitalization of “Shipping name,” “Hazard class,” and “UN number.”

COPYRIGHT INFORMATION

  • Initial Version:
    • “©Flinn Scientific, Inc. All Rights Reserved.”
  • Subsequent Version:
    • “© 2011 Flinn Scientific, Inc. All Rights Reserved.”
    • Change: Updated the copyright year to 2011.

These changes reflect updates in formatting, additional clarifications, and expanded instructions to ensure better understanding and compliance with safety and regulatory requirements.


The example PDFs are the MSDS files for Lysol dated July 2010 and December 2011

This actually look really good, do you think this is checking and subsections? Since in SDS there are subscetions like 1.1,2.2 etc… thanks for this i have to try it!

No idea.

I just grabbed a couple of random MSDS files and processed them.

I have just tested it seems to be good, it will be kind of expensive due to a lot of tokens are sent, but ye it is what it is. Had an issue with encoding I had to switch in general to utf-8 to make it run.

Are there any suggestions of larges files since my documents are 70+ pages long maybe split on chunks?