Best way to read Scanned PDFs ? CHATGPT vision does not do it? How Come!

Whats best way to read scanned PDFs. I am creating an HR tool which reads CVs and parses information out of them. In easy term think of it as “Chat for CV”

We are testing it out with Scanned PDFs and it does not work.

I thought ChatGPT Vision would solve this , since it takes images.

Does this mean, I need to now get an external OCR provider like IBM Watson or Abby and then use ChatGPT api on it ?

Please help out

Python has a few open source OCR libraries such as Tesseract that can be used. Just use that to parse the text of the pdfs and then let gpt api go through it. Or if you are not able to code yourself, there are also a lot of “scanned pdf to text pdf using OCR” converters online.

oh got it. I can code it myself, but i thought chatgpt vision is going to take care of it.

So I will use python to get the text out and then use ChatGPT…

I think the problem might be that the GPT Vision scales down the image, sometimes quite dramatically depending on the original resolution.
That could blurr the text from your scan.
If you cut the scan into appropriate chunks, it could work, but even then I don’t think it is really meant for OCR and could get overwhelmed by too much text.

I think an approach with a dedicated OCR model would probably be best, although you could combine the results with some GPT Vision to correctly detect layout and arrangements (e.g. in tables or diagrams). If it doesn’t need to read out the individual characters, the lowered resolution probably won’t matter as much.

What if im building a tool for personal use thats supposed to be able to “chat with any pdf” accurately? in my case the pdf does not only contain text, but some scanned images, graphs, and tables (a ton of tables with very important info) . I know how to use langchain and a vector db to take the text from the pdf, turn it into embeddings and retrieve it when queried. But, I am worried my approach wont be accurate enough for all the information thats also in the form of tables and/or graphs/images. What approach/specifc libraries would you suggest?

Im a beginner and im building this as a learning project, I would really appreciate your help with this :slight_smile:

1 Like

I’m facing the same issue.
Did you find any workarounds?

You can try formrecognizer as OCR tool, it works well for scanned and text pdfs.

1 Like

Coming back to this question.
With the files api now available, just upload the CV and then interact with it using messages (note the file_ids param).

The retrieval tool does a good job extracting text from the docs. Images and graphs aren’t retrieved.

Hi @the.brainiac,

thank you for your input. However, what if I need to extract from images on this pdf? For example scans? Did you find any workaround for this? Been on this topic for a few weeks now and I can’t find any solution.

Thanks for your help!

Poppler and x-pdf extract images for you, so you can feed them.
pro tip convert pdf to html, it will give you both the images + text + exact position of images.
GPT-4 is better at html than plain text.

1 Like

converting PDF to HTML and then pass the HTML to GPT gave me the best results in terms of data extraction.

Scanned pdf don’t work as of now, but I did find a workaround where I ended up converting the scanned images to WebP format and GPT was able to parse through the documents.