Understanding very large images

Hello experts,
I am a blind developer. I am investigating whether vision capability of ChatGPT can be used to help blind computer users to understand web pages. Basically I am thinking about writing a tool that will:

  1. Take a screenshot of an entire web page as it appears in browser.
  2. Send that screenshot to ChatGPT via API with some prompt, such as “describe visual layout of the page” or “describe region X on the page” or “Find a control element that does Y”.

But the problem I am running into is that ChatGPT seems to resize images before passing them to the model. Then chatGPT says something to the extent of “the text is too small, please zoom in on some part of your web page”.
Just to give you a concrete example, a sample web page I was playing with is on the larger side and its screenshot is about 1200x20000 pixels. Twenty thousand is not a typo, the height is indeed gigantic.
So far I read in ChatGPT docs that large images will be resized down to 768x2000 in high-res mode. No wonder then that the model can’t read any text then.
The model suggested to zoom in on a region - that doesn’t work for blind people as there is no way they can pinpoint the region they’re interested in and I can’t think of a way to automatically detect region to zoom in on.
The docs also suggest to split large images into smaller ones, but I am worried that this might cause side effects, such as>

  1. If I happen to cut my screenshot so that the cut line goes through some text, I am worried that this text might be lost as the model won’t be able to recognize half cut text in either tiles.
  2. General confusion of the model: if a large visual element such as a table turns out to be split into multiple tiles, the model might not view it as a single table but rather a sequence of unrelated tables.

While these side effects are not critical for my use case (arguably), I still wish I could find a more elegant solution. So wondering - am I missing anything? Can my problem be solved in any better way than just splitting into smaller images? Is there a smart way to split into smaller images that wouldn’t at least cut any text in half?
Also if openAI developers are reading this, may I suggest a feature request for the next version: it would be great if the model could have an ability to automatically zoom in on a specific part of a giant image. I feel that would help a great deal in my use case.
Thanks!

I don’t get this issue with gpt-4o.
I “manually” have to resize the image 2000px width to save some tokens.

Currently dealing with an image of 1280 × 14341.

I’ve also tried splitting it in chunks using “sharp” in node to see if it improved performance or respond time, but I’m not getting any noticeable benefits.

The API does downsizing for you.

Downsizing locally to match doesn’t save you tokens (tiles), but saves you network bandwidth.

The longer dimension is maximum 2048 pixels, and then after that, another downsize if the shorter dimension is over 768 pixels. Except on gpt-4.1 mini an nano, then a different second-pass downsize is used based on token cost instead of 768px.

I made a tool online to help you understand that and its model pricing. You can either specify dimensions of an image to be added to cost calculations, or can upload images or image URL to the script.

Did you ever find a solution to your problem? I am in the same boat as you