Converting PDF to Markdown with OCR

ChatGPT can convert PDF documents to Markdown, and does it extremely well. How can we access this functionality from the API?

3 Likes

I mean GPT4o is multi modal. It can take images as well as pdf files I guess.

So you base 64 encode the file and send it.

Of couse you should also change the prompt to something you are doing in ChatGPT normally.

But if you want to save on API cost I would suggest to use something like ghostscript to split the PDF in single tiff files and pytesseract to convert the PDF to hocr (in a loop over each tiff).

And then use GPT-3.5 with a prompt like

give me markdown from this hocr:

[hocr]

Or maybe before you do that you can try without GPT-4?

Like this:


import os
import base64
import pytesseract
from pdf2image import convert_from_path
from bs4 import BeautifulSoup

# Function to convert PDF to images
def pdf_to_images(pdf_path):
    return convert_from_path(pdf_path, fmt='tiff')

# Function to convert image to hOCR using pytesseract
def image_to_hocr(image):
    return pytesseract.image_to_pdf_or_hocr(image, extension='hocr')

# Function to convert hOCR to markdown
def hocr_to_markdown(hocr):
    soup = BeautifulSoup(hocr, 'html.parser')
    markdown_text = ""

    for line in soup.find_all('span', class_='ocr_line'):
        line_text = " ".join([word.get_text() for word in line.find_all('span', class_='ocrx_word')])
        markdown_text += f"{line_text}\n"

    return markdown_text

# Main function to convert PDF to Markdown
def pdf_to_markdown(pdf_path):
    images = pdf_to_images(pdf_path)
    markdown_text = ""

    for image in images:
        hocr = image_to_hocr(image)
        markdown_text += hocr_to_markdown(hocr) + "\n\n"

    return markdown_text

# Example usage
if __name__ == "__main__":
    pdf_path = "sample.pdf"  # Path to your PDF file
    markdown_output = pdf_to_markdown(pdf_path)
    with open("output.md", "w") as file:
        file.write(markdown_output)

And then use GPT 4 only for semantical extraction?

Or maybe you can try something like this on your linux cli:

for img in $(pdfimages -tiff sample.pdf output); do tesseract "$img" stdout hocr | sed -n 's/.*ocr_line.*>\(.*\)<.*/\1/p' >> output.md; done
1 Like

I’ve tried Tesseract. Accuracy is not close to being good enough for actual use, I mean, too much of its output is random noise that nothing is going to be able to extract meaning from. But ChatGPT does a great job when given the PDF itself, so I’m hoping to be able to achieve the same thing with the API.

2 Likes

The noise reduction is semantic analysis.

So you can create the markdown first and then send the markdown with the prompt to reduce the noise or ask for specific stuff you want to extract or a segmentation of the content.

You can also let it create a template with regular expression e.g. for an invoice.
When you have an invoice from the same company with same invoice format chances are good that you only need 2 or 3 gpt-3.5 requests until you got a nice template.

Or transform the hocr to geojson, import that into postgis and use some gis functions on it that you can even create on the fly with gpt-3.5 to search for area intersection instead of just semantic evaluation with an LLM that can have hallucinations :wink:

And when you are ok with bigger docker container you may also checkout

depending on the type of PDF you are going to use OCR on you may also want to prefer json - this way you can use a json schema to validate the result.


Personally I am a huge fan of the database solution here. You can even create stored functions like damerau levenstein and also regular expression with a custom database extension that can interact with a LLM :slight_smile: and many documents follow a strict norm e.g. letters where it is clear where certain information are.

I have made HUGE SQLs for AWS textract to compare the results of analyse_expense and analyse_text a couple of years ago.

I mean when you transform the scanned document to a homogeneous coordinate system and have a couple of polygons and bounding boxes of the OCR result you can compare them with ST_Intersects

Also interesting: “Voronoi” or Hausdorff distance - Wikipedia

1 Like

Of course the simple answer is to use the OCR in Adobe Acrobat which is made to textualize documents and improve what’s already there.

You can also get AI in it to chat about the PDF.

https://www.adobe.com/acrobat/generative-ai-pdf.html

1 Like

I guess you mean Adobe PDF Services API. Did not try that yet, but will do.

Have you tried their (adobe service API) service yet? I’m actually trying to use Extract PDF from following but I haven’t find anything to get the results in markdown format. Kindly lmk if you find anything in it.

They’re not allowing me to add url here just remove the spaces from following link:
https: //github .com/adobe/pdfservices-python-sdk-samples/tree/main

@jochenschultz Kindly respond as soon as you see my text.

It makes a “tiny little” difference whether one recognizes words or strings via OCR or determines the semantic meaning of these extractions. Some OCR systems already struggle with various character sets; for example, they might not recognize characters like ü, ä, or ö, or they can have issues when characters are very close to the edge of the document or if the document has been photographed and has stains or creases.

Adobe solves such problems… converting to YAML is approached differently and would be beyond the scope of this discussion—I would estimate around 4-5 years of learning time if you’re starting from scratch.

But that only if you are super smart. I wouldn’t say everyone can solve this.

Your response doesn’t actually address the question, I asked about converting PDFs to Markdown using Adobe’s services. Could you please provide some insights on that specific issue?

I’ve read a couple hundret posts tonight and I found the answer in it.
You can too.

Thank you, this is very helpful to me

Give pdftomarkdown.ai a try if you’re looking for an easy to use tool.