Whats best way to read scanned PDFs. I am creating an HR tool which reads CVs and parses information out of them. In easy term think of it as “Chat for CV”
We are testing it out with Scanned PDFs and it does not work.
I thought ChatGPT Vision would solve this , since it takes images.
Does this mean, I need to now get an external OCR provider like IBM Watson or Abby and then use ChatGPT api on it ?
Python has a few open source OCR libraries such as Tesseract that can be used. Just use that to parse the text of the pdfs and then let gpt api go through it. Or if you are not able to code yourself, there are also a lot of “scanned pdf to text pdf using OCR” converters online.
I think the problem might be that the GPT Vision scales down the image, sometimes quite dramatically depending on the original resolution.
That could blurr the text from your scan.
If you cut the scan into appropriate chunks, it could work, but even then I don’t think it is really meant for OCR and could get overwhelmed by too much text.
I think an approach with a dedicated OCR model would probably be best, although you could combine the results with some GPT Vision to correctly detect layout and arrangements (e.g. in tables or diagrams). If it doesn’t need to read out the individual characters, the lowered resolution probably won’t matter as much.
What if im building a tool for personal use thats supposed to be able to “chat with any pdf” accurately? in my case the pdf does not only contain text, but some scanned images, graphs, and tables (a ton of tables with very important info) . I know how to use langchain and a vector db to take the text from the pdf, turn it into embeddings and retrieve it when queried. But, I am worried my approach wont be accurate enough for all the information thats also in the form of tables and/or graphs/images. What approach/specifc libraries would you suggest?
Im a beginner and im building this as a learning project, I would really appreciate your help with this
thank you for your input. However, what if I need to extract from images on this pdf? For example scans? Did you find any workaround for this? Been on this topic for a few weeks now and I can’t find any solution.
Poppler and x-pdf extract images for you, so you can feed them.
pro tip convert pdf to html, it will give you both the images + text + exact position of images.
GPT-4 is better at html than plain text.
Scanned pdf don’t work as of now, but I did find a workaround where I ended up converting the scanned images to WebP format and GPT was able to parse through the documents.
Can you provide more detail on this, as I’ve a pdf with scanned images only (or in other words, scanned pdf), and when i sent that pdf for processing, it is getting failed, I tried converting the pdf using an online ocr tool also and then tried processing it, but got no luck with that.
I think getting the text from the scanned pdf or something like should work, but can i create an exact same pdf but inside instead of scanned text images, there is text.
GPT-4 can indeed process images. If you upload a scanned image PDF directly, it may not work correctly because the model will attempt to extract data from what it considers to be a text-based PDF. However, since the PDF consists of scanned images, the model cannot extract any meaningful data in this form.
To address this issue, you should implement a process to read the scanned PDF and convert each page into a webp format image. For example, if your PDF contains four pages, you would convert all four pages into individual webp images.
Once you have these images, you can then feed them into GPT-4. GPT-4o is well-suited for handling scanned and handwritten text and will be able to extract the necessary data from the images. After the text is extracted, you can convert it into a more structured format such as Markdown or any other format you need.
Finally, you can save this extracted and formatted data somewhere for further use. You might also consider using a Retrieval-Augmented Generation (RAG) approach to manage and query this extracted information effectively.