Why can't I read images via Completions API?

A language AI will simply answer your question in natural language. What you ask for is magic that has multiple paths needed where data cannot flow.

I gave a link to Google’s specialized models, where you can see “layout parser”.

GPT-4 vision does not have a high-quality “grounding”, the ability to reference locations of elements in an image.

One thing it certainly cannot do without an extensive agentic framework is “put this map in an identical output document”.


Even though I know what it cannot do right off the bat, lets assume that we understand the need to get out individual sections, such as individual columns and individual images, to have any hope of proceeding further. Then we go in with more educated naivety:

Here’s an application built with the top OpenAI reasoner, able to accept images larger than others due to its patches algorithm for vision input (although your image provided is small and poor).

SYSTEM

You are Docuo, a language model that uses your vision ability to reproduce data from documents as accurately and comprehensively as possible.
Stages of document preparation you may fulfill:

  • initial analysis of contents and quality
  • layout parsing, allowing slicing of the document programmatically into constituent elements
  • element data extraction, such as complete OCR, complete visual description for reconstruction.

Then we embark on just one round:

For the attached image, provide a structural framework about the document composition and layout for parsing/extraction:

  1. full image dimensions received,
  2. four-corner coordinate bounding boxes for contiguous texts
  3. four-corner bounding boxes for images

Then for an image actually 550 width x 750 height, yours, we can see each element produced does not fit the task. Vision cannot function as an element extractor, to facilitate any kind of document reconstruction:

Below is a suggested “slicing” of the scanned page into its principal structural blocks. All coordinates are given in pixels relative to the top‐left of the full image, which we take here as width=785 px, height=535 px.

Full‐page
• Corner points:
– top‐left (0, 0)
– top‐right (785, 0)
– bottom‐right (785, 535)
– bottom‐left (0, 535)

Header band (runs across the full width; contains “LAOS” / “66” / “LAO‐TZI”)
• Approx. y = 0 → 50 px
• Corners:
– TL (0, 0)
– TR (785, 0)
– BR (785, 50)
– BL (0, 50)

Left‐column map image (“LAOS – scară …”)
• Approx. x = 10 → 380, y = 60 → 260 px
• Corners:
– TL (10, 60)
– TR (380, 60)
– BR (380, 260)
– BL (10, 260)

Left‐column body text (under the map, continues to bottom margin)
• Approx. x = 10 → 380, y = 260 → 525 px
• Corners:
– TL (10, 260)
– TR (380, 260)
– BR (380, 525)
– BL (10, 525)

Right‐column body text (starts under the header, above the portrait)
• Approx. x = 395 → 775, y = 60 → 350 px
• Corners:
– TL (395, 60)
– TR (775, 60)
– BR (775, 350)
– BL (395, 350)

Right‐column portrait image (Lao-tzi) plus its caption
• Approx. x = 390 → 785, y = 350 → 525 px
• Corners:
– TL (390, 350)
– TR (785, 350)
– BR (785, 525)
– BL (390, 525)

You can of course refine these slightly in your own layout tool, but these six boxes will let you isolate (a) the header, (b) two image assets, and (c) the two main text flows for OCR.

Result: it’s first “extraction” by coordinates:

I would suggest that you use PDF software tools that are document-aware and are specifically designed to enhance PDF files with searchable text.

1 Like