Understanding very large images

anton.malykh · September 21, 2024, 6:13pm

Hello experts,
I am a blind developer. I am investigating whether vision capability of ChatGPT can be used to help blind computer users to understand web pages. Basically I am thinking about writing a tool that will:

Take a screenshot of an entire web page as it appears in browser.
Send that screenshot to ChatGPT via API with some prompt, such as “describe visual layout of the page” or “describe region X on the page” or “Find a control element that does Y”.

But the problem I am running into is that ChatGPT seems to resize images before passing them to the model. Then chatGPT says something to the extent of “the text is too small, please zoom in on some part of your web page”.
Just to give you a concrete example, a sample web page I was playing with is on the larger side and its screenshot is about 1200x20000 pixels. Twenty thousand is not a typo, the height is indeed gigantic.
So far I read in ChatGPT docs that large images will be resized down to 768x2000 in high-res mode. No wonder then that the model can’t read any text then.
The model suggested to zoom in on a region - that doesn’t work for blind people as there is no way they can pinpoint the region they’re interested in and I can’t think of a way to automatically detect region to zoom in on.
The docs also suggest to split large images into smaller ones, but I am worried that this might cause side effects, such as>

If I happen to cut my screenshot so that the cut line goes through some text, I am worried that this text might be lost as the model won’t be able to recognize half cut text in either tiles.
General confusion of the model: if a large visual element such as a table turns out to be split into multiple tiles, the model might not view it as a single table but rather a sequence of unrelated tables.

While these side effects are not critical for my use case (arguably), I still wish I could find a more elegant solution. So wondering - am I missing anything? Can my problem be solved in any better way than just splitting into smaller images? Is there a smart way to split into smaller images that wouldn’t at least cut any text in half?
Also if openAI developers are reading this, may I suggest a feature request for the next version: it would be great if the model could have an ability to automatically zoom in on a specific part of a giant image. I feel that would help a great deal in my use case.
Thanks!

Topic		Replies	Views
How Does the GPT-4V API deal with large Images? API gpt-4 , gpt-4-vision	0	907	January 22, 2024
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	2957	December 6, 2023
GPT-4 Vision Pixel Limitations API gpt-4	4	2774	March 26, 2024
OpenAI GPT-4o Image Processing: 500 Errors and Long Response Times with Larger Requests API gpt-4 , api-vision	1	57	September 25, 2024
Limitations of GPT-4V's high res tiling process? API api , gpt4-vision	1	382	April 9, 2024

Understanding very large images

Related Topics