How OCR work in Vision Language model (VLM) or Multi Language LLM (MLLLMs)

mujeeb.rehman · November 18, 2024, 3:32pm

I need to understand how OCR works in Vision Language Models (VLMs). Specifically, how do OpenAI’s Vision Language Models (VLMs) or Multilingual Large Language Models (MLLLMs), such as GPT-4, perform OCR when an image containing text is uploaded? ChatGPT mentioned that if there is text in an image, the model first performs OCR using a third-party API before further processing it.

slimypi · December 12, 2024, 9:12pm

A lot of people are wondering the same thing (including myself) but. I do notice that it seems openai does not provide that info even though it’s obvious they use something to ocr the pdfs especially ( tried multiple ones though chat gpt).

It would be great if we can know what is used so we can replicate the effectiveness through api.

I already used everything tesseract based out there in python to ocr pdfs. Nothing comes close to what chatgpt 4o uses.

That third api is so powerful it’s incredible by itself.

Topic		Replies	Views
Process scanned pdfs through api API gpt-4 , chatgpt , api , pdf , ocr	2	1046	December 12, 2024
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	3926	December 6, 2023
GPT4 OCR/Image Recognition API gpt-4	3	25630	December 18, 2023
OCR using API for text extraction API api	9	16815	December 18, 2024
Can an assistant help me with OCR? API gpt-4	7	3732	June 6, 2024

How OCR work in Vision Language model (VLM) or Multi Language LLM (MLLLMs)

Related topics