Using the Vision API: best practices

“detail”: “low”, as the documentation informs us, provides the AI with an image resized down to 512px on the longest dimension, and there are no “tiles” of repeated parts of the image also overlaid.

Resize your images to those dimensions and see if things are still legible.

With raw access to send an image of any size to be encoded to 85 tokens of “low”, one quickly sees that information theory holds. The amount of text you can get reproduced is less tokens than that before the hallucinations start.