Converting PDF to Markdown with OCR

russell4 · May 18, 2024, 10:31am

ChatGPT can convert PDF documents to Markdown, and does it extremely well. How can we access this functionality from the API?

jochenschultz · May 19, 2024, 1:14am

I mean GPT4o is multi modal. It can take images as well as pdf files I guess.

So you base 64 encode the file and send it.

Of couse you should also change the prompt to something you are doing in ChatGPT normally.

But if you want to save on API cost I would suggest to use something like ghostscript to split the PDF in single tiff files and pytesseract to convert the PDF to hocr (in a loop over each tiff).

And then use GPT-3.5 with a prompt like

give me markdown from this hocr:

[hocr]

jochenschultz · May 19, 2024, 1:17am

Or maybe before you do that you can try without GPT-4?

Like this:


import os
import base64
import pytesseract
from pdf2image import convert_from_path
from bs4 import BeautifulSoup

# Function to convert PDF to images
def pdf_to_images(pdf_path):
    return convert_from_path(pdf_path, fmt='tiff')

# Function to convert image to hOCR using pytesseract
def image_to_hocr(image):
    return pytesseract.image_to_pdf_or_hocr(image, extension='hocr')

# Function to convert hOCR to markdown
def hocr_to_markdown(hocr):
    soup = BeautifulSoup(hocr, 'html.parser')
    markdown_text = ""

    for line in soup.find_all('span', class_='ocr_line'):
        line_text = " ".join([word.get_text() for word in line.find_all('span', class_='ocrx_word')])
        markdown_text += f"{line_text}\n"

    return markdown_text

# Main function to convert PDF to Markdown
def pdf_to_markdown(pdf_path):
    images = pdf_to_images(pdf_path)
    markdown_text = ""

    for image in images:
        hocr = image_to_hocr(image)
        markdown_text += hocr_to_markdown(hocr) + "\n\n"

    return markdown_text

# Example usage
if __name__ == "__main__":
    pdf_path = "sample.pdf"  # Path to your PDF file
    markdown_output = pdf_to_markdown(pdf_path)
    with open("output.md", "w") as file:
        file.write(markdown_output)

And then use GPT 4 only for semantical extraction?

Or maybe you can try something like this on your linux cli:

for img in $(pdfimages -tiff sample.pdf output); do tesseract "$img" stdout hocr | sed -n 's/.*ocr_line.*>\(.*\)<.*/\1/p' >> output.md; done

russell4 · May 19, 2024, 1:47am

I’ve tried Tesseract. Accuracy is not close to being good enough for actual use, I mean, too much of its output is random noise that nothing is going to be able to extract meaning from. But ChatGPT does a great job when given the PDF itself, so I’m hoping to be able to achieve the same thing with the API.

jochenschultz · May 19, 2024, 1:57am

The noise reduction is semantic analysis.

So you can create the markdown first and then send the markdown with the prompt to reduce the noise or ask for specific stuff you want to extract or a segmentation of the content.

You can also let it create a template with regular expression e.g. for an invoice.
When you have an invoice from the same company with same invoice format chances are good that you only need 2 or 3 gpt-3.5 requests until you got a nice template.

Or transform the hocr to geojson, import that into postgis and use some gis functions on it that you can even create on the fly with gpt-3.5 to search for area intersection instead of just semantic evaluation with an LLM that can have hallucinations

And when you are ok with bigger docker container you may also checkout

depending on the type of PDF you are going to use OCR on you may also want to prefer json - this way you can use a json schema to validate the result.

Personally I am a huge fan of the database solution here. You can even create stored functions like damerau levenstein and also regular expression with a custom database extension that can interact with a LLM and many documents follow a strict norm e.g. letters where it is clear where certain information are.

I have made HUGE SQLs for AWS textract to compare the results of analyse_expense and analyse_text a couple of years ago.

I mean when you transform the scanned document to a homogeneous coordinate system and have a couple of polygons and bounding boxes of the OCR result you can compare them with ST_Intersects

Also interesting: “Voronoi” or Hausdorff distance - Wikipedia

_j · May 19, 2024, 6:53am

Of course the simple answer is to use the OCR in Adobe Acrobat which is made to textualize documents and improve what’s already there.

You can also get AI in it to chat about the PDF.

https://www.adobe.com/acrobat/generative-ai-pdf.html

jochenschultz · May 19, 2024, 11:16pm

I guess you mean Adobe PDF Services API. Did not try that yet, but will do.

abdulbasitab00k · August 29, 2024, 11:19pm

Have you tried their (adobe service API) service yet? I’m actually trying to use Extract PDF from following but I haven’t find anything to get the results in markdown format. Kindly lmk if you find anything in it.

They’re not allowing me to add url here just remove the spaces from following link:
https: //github .com/adobe/pdfservices-python-sdk-samples/tree/main

abdulbasitab00k · August 30, 2024, 10:56am

@jochenschultz Kindly respond as soon as you see my text.

jochenschultz · August 31, 2024, 8:25pm

It makes a “tiny little” difference whether one recognizes words or strings via OCR or determines the semantic meaning of these extractions. Some OCR systems already struggle with various character sets; for example, they might not recognize characters like ü, ä, or ö, or they can have issues when characters are very close to the edge of the document or if the document has been photographed and has stains or creases.

Adobe solves such problems… converting to YAML is approached differently and would be beyond the scope of this discussion—I would estimate around 4-5 years of learning time if you’re starting from scratch.

But that only if you are super smart. I wouldn’t say everyone can solve this.

abdulbasitab00k · September 2, 2024, 8:38am

Your response doesn’t actually address the question, I asked about converting PDFs to Markdown using Adobe’s services. Could you please provide some insights on that specific issue?

jochenschultz · September 2, 2024, 8:43am

I’ve read a couple hundret posts tonight and I found the answer in it.
You can too.

sitecui · September 2, 2024, 9:19am

Thank you, this is very helpful to me

nathanp · March 9, 2025, 5:38am

Give pdftomarkdown.ai a try if you’re looking for an easy to use tool.

jochenschultz · March 9, 2025, 6:47am

nathanp · July 1, 2025, 11:31am

@jochenschultz give it another try, have made some updates to allow for better free/anonymous usage.

If you sign up you also get 100 free credits.

jochenschultz · July 1, 2025, 12:17pm

So it handles 2000 pages now? - not that I need something like that (you can vibe code it now in a couple minutes)

Topic		Replies	Views
Accurately read PDF files? API	12	80937	December 12, 2023
OCR using API for text extraction API api	9	26456	December 18, 2024
What is the best way to parse a PDF file with ChatGPT? API	10	52014	January 10, 2026
How to Programmatically Extract Text from Images Using GPT-4 API gpt-4 , chatgpt , api , assistants-api	9	9740	October 14, 2024
ChatPDF.com - Chat with any PDF using the new ChatGPT API Community application , pdf , community	175	1019444	February 7, 2024

Converting PDF to Markdown with OCR

Related topics