I am experimenting with gpt-4 vision preview API. For normal images less than 1MB it’s not taking too much time. But with images more than 3 MB, it takes an average of 2 minutes. Is there any way to optimize it?
The image is scaled in their system by being sized to longest dimension max 2048 pixels. And then scaled again so the shortest side is max 768 pixels.
That gives a 3:2 camera photo a “vision” resolution of 1152 x 768 = 6 section tiles plus the original 512x512 tile.
You could see how an image fares when it is resized on your end to max 1024 for max 4 tiles. Or ultimate “fast”, resize to 512x512.