I need to understand how OCR works in Vision Language Models (VLMs). Specifically, how do OpenAI’s Vision Language Models (VLMs) or Multilingual Large Language Models (MLLLMs), such as GPT-4, perform OCR when an image containing text is uploaded? ChatGPT mentioned that if there is text in an image, the model first performs OCR using a third-party API before further processing it.
A lot of people are wondering the same thing (including myself) but. I do notice that it seems openai does not provide that info even though it’s obvious they use something to ocr the pdfs especially ( tried multiple ones though chat gpt).
It would be great if we can know what is used so we can replicate the effectiveness through api.
I already used everything tesseract based out there in python to ocr pdfs. Nothing comes close to what chatgpt 4o uses.
That third api is so powerful it’s incredible by itself.