I am trying to build a rag using gpt models for a digital doc library.
Some of these docs are pdf files that are scanned.
I’m trying to figure out how to go about with this using openai models especially since the text extraction is spot on when I tested a few scanned docs on chatgpt.
I’ve experienced with other models but they don’t seem to work very well (llama 3.2-vision).
Anyone knows how does gpt process the pdfs this precisely ? Can this be replicated though api?
You can make an API call to extract the text from a document, but it needs to be done at the point you create the RAG daatabase. Once it’s created, you don’t need to do it again.
The vision ability can be invoked from the API by following this example:
What I don’t understand is how does chatgpt handle a scanned pdf. Does it do the same?
Meaning when inputed a pdf it transforms it to images and understands them through vision ?
Because :
1- it does it super fast
2- it seems to keep context of the whole pdf as if it reads those multi pages as one item even for tens of pages pdf.
Or does it use some super duper extra ocr capability to process the files efficiently since my experience with vision (other models) indicates that the model is good at understanding one image but struggles when sent multiple images.
especially since there is also a limit on number of
images processed in each question (10 images) and I’ve uploaded 70 pages scanned pdf to gpt before and he was able to understand it as a whole. (It was a lease contract) .
So does it do vision 7 times for this file while keeping context ?
I’m really confused between what the docs say and what my experience shows while using gpt so I must be missing something.