Read a documents (docx or PDF) from a folder and replace matching text on each matched documents and save it as new version

Foxalabs · February 21, 2025, 10:02pm

The PDF/DocX to text bit is fairly trivial and can be done in a few lines of python,

# importing required modules
from pypdf import PdfReader

# creating a pdf reader object
reader = PdfReader('example.pdf')

# printing number of pages in pdf file
print(len(reader. Pages))

# getting a specific page from the pdf file
page = reader. Pages[0]

# extracting text from page
text = page.extract_text()
print(text)

but creating the pdf back again from that text is not so trivial. It’s quite simple technically, but get look right is the tricky bit, a super simple example of text to PDF

from fpdf import FPDF
 
 
# save FPDF() class into a 
# variable pdf
pdf = FPDF()
 
# Add a page
pdf.add_page()
 
# set style and size of font 
# that you want in the pdf
pdf.set_font("Arial", size = 15)
 
# create a cell
pdf.cell(200, 10, txt = "GeeksforGeeks", 
         ln = 1, align = 'C')
 
# add another cell
pdf.cell(200, 10, txt = "A Computer Science portal for geeks.",
         ln = 2, align = 'C')
 
# save the pdf with name .pdf
pdf.output("GFG.pdf")

you could position that output text in the right place with some experimentation and that should do it.

(DocX extraction example, I’m sure there is a similar library to create them, but I’ve not used it before, chatgpt would know)

import pypandoc

# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")

Topic		Replies	Views
What is the best way to parse a PDF file with ChatGPT? API	9	50573	November 16, 2024
GPT-4 API for Educational Application API gpt-4 , chatgpt	2	1537	January 24, 2025
Programatically reproduce gpt-4o file upload API gpt-4o	5	1268	December 19, 2024
I am using the gpt-4 model and I want it to be able to read documents and respond to me based on the documents API	4	2810	November 22, 2023
Accurately read PDF files? API	12	80179	December 12, 2023

Read a documents (docx or PDF) from a folder and replace matching text on each matched documents and save it as new version

Related topics