Why can't I read images via Completions API?

_j · May 15, 2025, 2:31am

A language AI will simply answer your question in natural language. What you ask for is magic that has multiple paths needed where data cannot flow.

I gave a link to Google’s specialized models, where you can see “layout parser”.

GPT-4 vision does not have a high-quality “grounding”, the ability to reference locations of elements in an image.

One thing it certainly cannot do without an extensive agentic framework is “put this map in an identical output document”.

Even though I know what it cannot do right off the bat, lets assume that we understand the need to get out individual sections, such as individual columns and individual images, to have any hope of proceeding further. Then we go in with more educated naivety:

Here’s an application built with the top OpenAI reasoner, able to accept images larger than others due to its patches algorithm for vision input (although your image provided is small and poor).

SYSTEM

You are Docuo, a language model that uses your vision ability to reproduce data from documents as accurately and comprehensively as possible.
Stages of document preparation you may fulfill:

initial analysis of contents and quality

layout parsing, allowing slicing of the document programmatically into constituent elements

element data extraction, such as complete OCR, complete visual description for reconstruction.

Then we embark on just one round:

For the attached image, provide a structural framework about the document composition and layout for parsing/extraction:

full image dimensions received,

four-corner coordinate bounding boxes for contiguous texts

four-corner bounding boxes for images

Then for an image actually 550 width x 750 height, yours, we can see each element produced does not fit the task. Vision cannot function as an element extractor, to facilitate any kind of document reconstruction:

Below is a suggested “slicing” of the scanned page into its principal structural blocks. All coordinates are given in pixels relative to the top‐left of the full image, which we take here as width=785 px, height=535 px.

Full‐page
• Corner points:
– top‐left (0, 0)
– top‐right (785, 0)
– bottom‐right (785, 535)
– bottom‐left (0, 535)

Header band (runs across the full width; contains “LAOS” / “66” / “LAO‐TZI”)
• Approx. y = 0 → 50 px
• Corners:
– TL (0, 0)
– TR (785, 0)
– BR (785, 50)
– BL (0, 50)

Left‐column map image (“LAOS – scară …”)
• Approx. x = 10 → 380, y = 60 → 260 px
• Corners:
– TL (10, 60)
– TR (380, 60)
– BR (380, 260)
– BL (10, 260)

Left‐column body text (under the map, continues to bottom margin)
• Approx. x = 10 → 380, y = 260 → 525 px
• Corners:
– TL (10, 260)
– TR (380, 260)
– BR (380, 525)
– BL (10, 525)

Right‐column body text (starts under the header, above the portrait)
• Approx. x = 395 → 775, y = 60 → 350 px
• Corners:
– TL (395, 60)
– TR (775, 60)
– BR (775, 350)
– BL (395, 350)

Right‐column portrait image (Lao-tzi) plus its caption
• Approx. x = 390 → 785, y = 350 → 525 px
• Corners:
– TL (390, 350)
– TR (785, 350)
– BR (785, 525)
– BL (390, 525)

You can of course refine these slightly in your own layout tool, but these six boxes will let you isolate (a) the header, (b) two image assets, and (c) the two main text flows for OCR.

Result: it’s first “extraction” by coordinates:

I would suggest that you use PDF software tools that are document-aware and are specifically designed to enhance PDF files with searchable text.

Topic		Replies	Views
Can an assistant help me with OCR? API gpt-4	7	3373	June 6, 2024
How to solve the problem that GPT-API cannot read text using OCR? API	19	3610	July 10, 2024
GPT-4o Model: Image Coordinate Recognition API gpt-4	31	5544	March 8, 2025
Resize parameter for gpt-4-vision-preview API	5	15141	December 10, 2023
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	3815	December 6, 2023

Why can't I read images via Completions API?

Related topics