The gpt-4-vision documentation states the following:
low will disable the “high res” model. The model will receive a low-res 512 x 512 version of the image, and represent the image with a budget of 65 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.
Does that mean that the image is resized proportionally to fit inside a 512 x 512 square, or the image is transformed to that shape?
That is, when proportion matters (perhaps it does generally), should I resize my images to fit inside a 512 x 512 square ahead of time. Or does the resizing backend take this into account?
Hi and welcome to the Developer Forum!
If you wish to resize and crop images, then that’s great, if not the system will automatically manipulate the image to deal with that.
Thanks for the reply. I understand that it will do it automatically - my question is more what is done automatically, i.e. how is the resizing/cropping performed. Is proportion taken into account?
I don’t know the exact internals but I imagine it would work like any picture viewer, if you have a very wide image being viewed in a 512x512 screen it would have lots of the blank space at the top and bottom. A simple shrink to fit would seem to be appropriate.
I agree. But the documentation states explicitly that it is resized to 512 x 512 . So I was looking for some confirmation here re. the backend.
right, so the only algorithm that could be applied to any generic image in order to comply with that would be a proportional shrink to fit, now could it be a stretched image? possibly, but the model does seem to be aware of proportionality and aspect ratio, so that rules that out in my mind, leaving only shrink to fit.
My guess? It’s an OpenCV image shrink to fit call.