I have been reading about OCR capabilities with images, and all I have found that is a little helpful is a post about using Spectre.Console.dll. What is returned by Chat is a synopsis of what the images says, but doesnt return the exact text as it was read. Can someone point me to an example of how that is done.
The vision models aren’t really built for OCR. You might have more success with dedicated OCR tools like tesseract or something, and then deal with the text. They can read or extract some things, but the issue is that they tend to hallucinate things - infer things to be there that could logically be, but aren’t.
You can think of it as being shown a picture for a second, and then being asked to desciribe in high detail what you just saw. That’s why, at the moment, it doesn’t really work the way you’re expecting.
I have found that as context window increases more and more applications for Document Chat mostly are turning to just pdf -->split into pages → pages converted Into images and then one by one provided to AI to extract the text.
Same as tesseract basically. However, I have tried and Assistant still isn’t capable enough to get 100% of the text using again tesseract and relies heavily to be supervised and revisions to be made.
In OpenAI cookbook I found an application similar it was named Parse PDF for GPT4o or something similar sorry I haven’t gotten the chance to find it.
Its interesting to me that I can get chat to tell me about what was written in the image but has no way of returning what it has interpreted from the image. If it can create a generalization of the document, I would think it could just return everything it “sees”.
I am testing against a scanned document of a Telescope Installation Instructions. When I gave the exact instructions you listed above, it did a little better but not much. It still will not give me a verbatim transcript.
I also had a system message giving the AI its ability to repeat back verbatim the contents within any images without alteration.
Something to consider: an image is 85 tokens (or less with gpt-4o). It would be an ultimate feat of compression if that could contain more than 85 tokens of language to be extracted reliably.