Limitations of GPT-4V's high res tiling process?

In using GPT-4V for tables that are significantly bigger than 512x512 in high res mode, I’m finding pretty frequent row and column confusion, where the right value is extracted in the wrong row/column. However when I condense the image to something closer to 512x512 I get great performance, as long as the text hasn’t been rescaled to the extent that it’s less legible. So I’m wondering if this is a limitation of the tiling process.

Is it possible to get more details on the implementation for this tiling? My concern is that for tables in particular, long range connections are important. If a particular value and the corresponding row/column header appear in separate tiles and the model’s seeing these tiles independently, it’s easy to see how this could lead to row/column confusion (depending on the exact implementation). I think this is especially true when the low res copy has been downscaled to the point that it’s not legible.

There’s an alternate way of sending images that are larger and untiled…

user_standard_image_message = [
{"role": "user", "content": [
    user_text_str,
    {"image": base64_image1},  # no detail settings required
    {"image": base64_image2},
    ]
}]

It is higher resolution than 512x512, because things like text that can’t be seen otherwise at “detail: low” can be resolved. No “url” download, only data.

There is something interesting about the way vision works that can be revealed, then, its inability to “see” more than a certain amount, perhaps due to attention layer limits. Send a whole page to that method, and it starts to hallucinate contents after a paragraph or two of verbatim OCR.