I have a use case.
Would it be possible to read the documents (DOCX or PDF) from a given folder, replace matching text in each matched document, and save it as a new version? I don’t want to write any code or anything. My question is, does OpenAI have any capabilities to do that with an API call?
No, that would require at least some code to achieve with the API.
2 Likes
Thank you, [Foxalabs], for the updates. What level of code do I need to develop to achieve this? Can you give me some idea? I am new to this. Also, at what point can I call OpenAPI, etc.? Thank you, and I really appreciate it.
The PDF/DocX to text bit is fairly trivial and can be done in a few lines of python,
# importing required modules
from pypdf import PdfReader
# creating a pdf reader object
reader = PdfReader('example.pdf')
# printing number of pages in pdf file
print(len(reader. Pages))
# getting a specific page from the pdf file
page = reader. Pages[0]
# extracting text from page
text = page.extract_text()
print(text)
but creating the pdf back again from that text is not so trivial. It’s quite simple technically, but get look right is the tricky bit, a super simple example of text to PDF
from fpdf import FPDF
# save FPDF() class into a
# variable pdf
pdf = FPDF()
# Add a page
pdf.add_page()
# set style and size of font
# that you want in the pdf
pdf.set_font("Arial", size = 15)
# create a cell
pdf.cell(200, 10, txt = "GeeksforGeeks",
ln = 1, align = 'C')
# add another cell
pdf.cell(200, 10, txt = "A Computer Science portal for geeks.",
ln = 2, align = 'C')
# save the pdf with name .pdf
pdf.output("GFG.pdf")
you could position that output text in the right place with some experimentation and that should do it.
(DocX extraction example, I’m sure there is a similar library to create them, but I’ve not used it before, chatgpt would know)
import pypandoc
# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")
1 Like
Thank you, Sir. I have one question: How does ChatGPT create and make downloading a PDF or DOCX document with requested content via chat?
1 Like
it has access to coding tools and it runs the code it makes in a small virtual machine that gets created there and then and lasts about 45 mins, inside that machine it runs code similar to that and generates your reqsted file with a download link.
2 Likes
Thank you for the response. Have a great weekend!
One last question. I know you have already answered, but I need to double-check and clear my mind. I know OpenAPI offers APIs to chat with models, and we can consume those APIs in our program locally. Would it be possible to send either a PDF or DOCX file, search and replace a word in the content , and then send back the results, similar to how ChatGPT handles chat for creating PDFs or DOCX files?
You can use tesseract OCR, create a hocr file to keep formatting, then remove the bounding boxes that hold the text you want to replace, then use tools like hocrtopdf…
#!/usr/bin/env bash
# Short runnable script (Bash) that uses Tesseract + hocr tools only
# to read a PDF, replace strings, and save it back to PDF.
# Usage:
# ./replace_pdf_text.sh input.pdf "old_string" "new_string"
# Result:
# Creates output_replaced.pdf with all occurrences replaced.
# Requirements:
# - pdftoppm (from poppler-utils)
# - tesseract
# - hocr2pdf
# - pdfunite (from poppler-utils)
# Do NOT complain about PyMuPDF, PDFMiner, PyPDF2, or pdfplumber :-)
INPUT_PDF="$1"
OLD_STR="$2"
NEW_STR="$3"
# Convert PDF to a series of PPM images at 300 dpi
pdftoppm -r 300 "$INPUT_PDF" page
PAGE=1
OUTPUT_PDFS=()
while true; do
IMAGE="page-${PAGE}.ppm"
if [ ! -f "$IMAGE" ]; then
break
fi
# Run Tesseract in HOCR mode
BASE="page-${PAGE}"
tesseract "$IMAGE" "$BASE" hocr
# Replace the text in the HOCR file
HOCR_FILE="${BASE}.hocr"
sed -i "s/${OLD_STR}/${NEW_STR}/g" "$HOCR_FILE"
# Convert HOCR + original image back to PDF
OUTPUT_PDF="replaced-${PAGE}.pdf"
hocr2pdf -i "$IMAGE" -r 300 -s -o "$OUTPUT_PDF" < "$HOCR_FILE"
OUTPUT_PDFS+=("$OUTPUT_PDF")
PAGE=$((PAGE+1))
done
# Combine individual PDFs into one
pdfunite "${OUTPUT_PDFS[@]}" "output_replaced.pdf"
echo "Done. Modified PDF is output_replaced.pdf"
this is just simplified… you need to extend the sed regex