The PDF/DocX to text bit is fairly trivial and can be done in a few lines of python,
# importing required modules
from pypdf import PdfReader
# creating a pdf reader object
reader = PdfReader('example.pdf')
# printing number of pages in pdf file
print(len(reader. Pages))
# getting a specific page from the pdf file
page = reader. Pages[0]
# extracting text from page
text = page.extract_text()
print(text)
but creating the pdf back again from that text is not so trivial. It’s quite simple technically, but get look right is the tricky bit, a super simple example of text to PDF
from fpdf import FPDF
# save FPDF() class into a
# variable pdf
pdf = FPDF()
# Add a page
pdf.add_page()
# set style and size of font
# that you want in the pdf
pdf.set_font("Arial", size = 15)
# create a cell
pdf.cell(200, 10, txt = "GeeksforGeeks",
ln = 1, align = 'C')
# add another cell
pdf.cell(200, 10, txt = "A Computer Science portal for geeks.",
ln = 2, align = 'C')
# save the pdf with name .pdf
pdf.output("GFG.pdf")
you could position that output text in the right place with some experimentation and that should do it.
(DocX extraction example, I’m sure there is a similar library to create them, but I’ve not used it before, chatgpt would know)
import pypandoc
# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")