We don’t know how the backend preprocessing of ChatGPT works for image computer vision.
However we do know for API: the image is split into tiles if over 512 pixels in any dimension, and then a read of the main tile plus processing of the subtiles is performed.
Example, where I show a high-quality PDF-to-image rendering using Adobe tools, at the maximum size the API will allow (only 768px wide), and then demonstrate API tile size in red (although they may be evenly divided).
That may add to the confusion, along with the ultimate low resolution. GPT-4-vision for OCR is a poor use of the AI on a nearly-solved problem.
Techniques:
- try at max 512 pixels to avoid tiling
- try with slices, cutting a page into smaller lengths of text.