Best way to read Scanned PDFs ? CHATGPT vision does not do it? How Come!

Whats best way to read scanned PDFs. I am creating an HR tool which reads CVs and parses information out of them. In easy term think of it as “Chat for CV”

We are testing it out with Scanned PDFs and it does not work.

I thought ChatGPT Vision would solve this , since it takes images.

Does this mean, I need to now get an external OCR provider like IBM Watson or Abby and then use ChatGPT api on it ?

Please help out

Python has a few open source OCR libraries such as Tesseract that can be used. Just use that to parse the text of the pdfs and then let gpt api go through it. Or if you are not able to code yourself, there are also a lot of “scanned pdf to text pdf using OCR” converters online.

oh got it. I can code it myself, but i thought chatgpt vision is going to take care of it.

So I will use python to get the text out and then use ChatGPT…

I think the problem might be that the GPT Vision scales down the image, sometimes quite dramatically depending on the original resolution.
That could blurr the text from your scan.
If you cut the scan into appropriate chunks, it could work, but even then I don’t think it is really meant for OCR and could get overwhelmed by too much text.

I think an approach with a dedicated OCR model would probably be best, although you could combine the results with some GPT Vision to correctly detect layout and arrangements (e.g. in tables or diagrams). If it doesn’t need to read out the individual characters, the lowered resolution probably won’t matter as much.

What if im building a tool for personal use thats supposed to be able to “chat with any pdf” accurately? in my case the pdf does not only contain text, but some scanned images, graphs, and tables (a ton of tables with very important info) . I know how to use langchain and a vector db to take the text from the pdf, turn it into embeddings and retrieve it when queried. But, I am worried my approach wont be accurate enough for all the information thats also in the form of tables and/or graphs/images. What approach/specifc libraries would you suggest?

Im a beginner and im building this as a learning project, I would really appreciate your help with this :slight_smile:

1 Like

Hi,
I’m facing the same issue.
Did you find any workarounds?

You can try formrecognizer as OCR tool, it works well for scanned and text pdfs.

1 Like

Coming back to this question.
With the files api now available, just upload the CV and then interact with it using messages (note the file_ids param).

The retrieval tool does a good job extracting text from the docs. Images and graphs aren’t retrieved.

Hi @the.brainiac,

thank you for your input. However, what if I need to extract from images on this pdf? For example scans? Did you find any workaround for this? Been on this topic for a few weeks now and I can’t find any solution.

Thanks for your help!

Poppler and x-pdf extract images for you, so you can feed them.
pro tip convert pdf to html, it will give you both the images + text + exact position of images.
GPT-4 is better at html than plain text.

1 Like

converting PDF to HTML and then pass the HTML to GPT gave me the best results in terms of data extraction.

Scanned pdf don’t work as of now, but I did find a workaround where I ended up converting the scanned images to WebP format and GPT was able to parse through the documents.

Can you provide more detail on this, as I’ve a pdf with scanned images only (or in other words, scanned pdf), and when i sent that pdf for processing, it is getting failed, I tried converting the pdf using an online ocr tool also and then tried processing it, but got no luck with that.

I think getting the text from the scanned pdf or something like should work, but can i create an exact same pdf but inside instead of scanned text images, there is text.

GPT-4 can indeed process images. If you upload a scanned image PDF directly, it may not work correctly because the model will attempt to extract data from what it considers to be a text-based PDF. However, since the PDF consists of scanned images, the model cannot extract any meaningful data in this form.

To address this issue, you should implement a process to read the scanned PDF and convert each page into a webp format image. For example, if your PDF contains four pages, you would convert all four pages into individual webp images.

Once you have these images, you can then feed them into GPT-4. GPT-4o is well-suited for handling scanned and handwritten text and will be able to extract the necessary data from the images. After the text is extracted, you can convert it into a more structured format such as Markdown or any other format you need.

Finally, you can save this extracted and formatted data somewhere for further use. You might also consider using a Retrieval-Augmented Generation (RAG) approach to manage and query this extracted information effectively.

I use the AI PDF Drive GPT and then instruct it to give me the text page by page.