Converting PDF to Markdown with OCR

jochenschultz · May 19, 2024, 1:14am

I mean GPT4o is multi modal. It can take images as well as pdf files I guess.

So you base 64 encode the file and send it.

Of couse you should also change the prompt to something you are doing in ChatGPT normally.

But if you want to save on API cost I would suggest to use something like ghostscript to split the PDF in single tiff files and pytesseract to convert the PDF to hocr (in a loop over each tiff).

And then use GPT-3.5 with a prompt like

give me markdown from this hocr:

[hocr]

Topic		Replies	Views
Accurately read PDF files? API	11	81102	August 29, 2023
OCR using API for text extraction API api	9	29170	December 18, 2024
What is the best way to parse a PDF file with ChatGPT? API	9	52286	November 16, 2024
How to Programmatically Extract Text from Images Using GPT-4 API gpt-4 , chatgpt , api , assistants-api	9	9949	October 14, 2024
ChatPDF.com - Chat with any PDF using the new ChatGPT API Community application , pdf , community	174	1020648	February 7, 2024

Converting PDF to Markdown with OCR

Related topics