Read a documents (docx or PDF) from a folder and replace matching text on each matched documents and save it as new version

jose123thadathil · February 21, 2025, 9:00pm

I have a use case.

Would it be possible to read the documents (DOCX or PDF) from a given folder, replace matching text in each matched document, and save it as a new version? I don’t want to write any code or anything. My question is, does OpenAI have any capabilities to do that with an API call?

Foxalabs · February 21, 2025, 9:26pm

No, that would require at least some code to achieve with the API.

jose123thadathil · February 21, 2025, 9:51pm

Thank you, [Foxalabs], for the updates. What level of code do I need to develop to achieve this? Can you give me some idea? I am new to this. Also, at what point can I call OpenAPI, etc.? Thank you, and I really appreciate it.

Foxalabs · February 21, 2025, 10:02pm

The PDF/DocX to text bit is fairly trivial and can be done in a few lines of python,

# importing required modules
from pypdf import PdfReader

# creating a pdf reader object
reader = PdfReader('example.pdf')

# printing number of pages in pdf file
print(len(reader. Pages))

# getting a specific page from the pdf file
page = reader. Pages[0]

# extracting text from page
text = page.extract_text()
print(text)

but creating the pdf back again from that text is not so trivial. It’s quite simple technically, but get look right is the tricky bit, a super simple example of text to PDF

from fpdf import FPDF
 
 
# save FPDF() class into a 
# variable pdf
pdf = FPDF()
 
# Add a page
pdf.add_page()
 
# set style and size of font 
# that you want in the pdf
pdf.set_font("Arial", size = 15)
 
# create a cell
pdf.cell(200, 10, txt = "GeeksforGeeks", 
         ln = 1, align = 'C')
 
# add another cell
pdf.cell(200, 10, txt = "A Computer Science portal for geeks.",
         ln = 2, align = 'C')
 
# save the pdf with name .pdf
pdf.output("GFG.pdf")

you could position that output text in the right place with some experimentation and that should do it.

(DocX extraction example, I’m sure there is a similar library to create them, but I’ve not used it before, chatgpt would know)

import pypandoc

# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")

jose123thadathil · February 21, 2025, 10:07pm

Thank you, Sir. I have one question: How does ChatGPT create and make downloading a PDF or DOCX document with requested content via chat?

Foxalabs · February 21, 2025, 10:10pm

it has access to coding tools and it runs the code it makes in a small virtual machine that gets created there and then and lasts about 45 mins, inside that machine it runs code similar to that and generates your reqsted file with a download link.

jose123thadathil · February 21, 2025, 10:59pm

Thank you for the response. Have a great weekend!

jose123thadathil · February 21, 2025, 11:27pm

One last question. I know you have already answered, but I need to double-check and clear my mind. I know OpenAPI offers APIs to chat with models, and we can consume those APIs in our program locally. Would it be possible to send either a PDF or DOCX file, search and replace a word in the content , and then send back the results, similar to how ChatGPT handles chat for creating PDFs or DOCX files?

jochenschultz · February 22, 2025, 1:54am

You can use tesseract OCR, create a hocr file to keep formatting, then remove the bounding boxes that hold the text you want to replace, then use tools like hocrtopdf…

#!/usr/bin/env bash

# Short runnable script (Bash) that uses Tesseract + hocr tools only
# to read a PDF, replace strings, and save it back to PDF.

# Usage:
#   ./replace_pdf_text.sh input.pdf "old_string" "new_string"
# Result:
#   Creates output_replaced.pdf with all occurrences replaced.

# Requirements:
#   - pdftoppm (from poppler-utils)
#   - tesseract
#   - hocr2pdf
#   - pdfunite (from poppler-utils)

# Do NOT complain about PyMuPDF, PDFMiner, PyPDF2, or pdfplumber :-)

INPUT_PDF="$1"
OLD_STR="$2"
NEW_STR="$3"

# Convert PDF to a series of PPM images at 300 dpi
pdftoppm -r 300 "$INPUT_PDF" page

PAGE=1
OUTPUT_PDFS=()

while true; do
  IMAGE="page-${PAGE}.ppm"
  if [ ! -f "$IMAGE" ]; then
    break
  fi
  
  # Run Tesseract in HOCR mode
  BASE="page-${PAGE}"
  tesseract "$IMAGE" "$BASE" hocr
  
  # Replace the text in the HOCR file
  HOCR_FILE="${BASE}.hocr"
  sed -i "s/${OLD_STR}/${NEW_STR}/g" "$HOCR_FILE"
  
  # Convert HOCR + original image back to PDF
  OUTPUT_PDF="replaced-${PAGE}.pdf"
  hocr2pdf -i "$IMAGE" -r 300 -s -o "$OUTPUT_PDF" < "$HOCR_FILE"
  
  OUTPUT_PDFS+=("$OUTPUT_PDF")
  PAGE=$((PAGE+1))
done

# Combine individual PDFs into one
pdfunite "${OUTPUT_PDFS[@]}" "output_replaced.pdf"

echo "Done. Modified PDF is output_replaced.pdf"

this is just simplified… you need to extend the sed regex

Topic		Replies	Views
What is the best way to parse a PDF file with ChatGPT? API	9	49604	November 16, 2024
GPT-4 API for Educational Application API gpt-4 , chatgpt	2	1504	January 24, 2025
Programatically reproduce gpt-4o file upload API gpt-4o	5	1114	December 19, 2024
I am using the gpt-4 model and I want it to be able to read documents and respond to me based on the documents API	4	2797	November 22, 2023
Accurately read PDF files? API	12	79735	December 12, 2023

Read a documents (docx or PDF) from a folder and replace matching text on each matched documents and save it as new version

Related topics