https://platform.openai.com/docs/guides/vision/calculating-costs
The image you send is first downsized so that the longest dimension is 2048 or under.
The image is again downsized so the shortest dimension is 768 or under.
Then a tile grid is split over that image … tiles which are 512x512. This means that a “photo grid” like is shown in the first image at 910x910 would be downsized to 768x768. That takes up the area of four tiles, with 25% of the resulting 1024x1024 grid unused (or overlapped). If you were to send a 720p image (or 3x3 in its size) that’s a processed area of 1280x720, six tiles, and unless you make a double-width image, full 1920x1080 will be downsized similarly, to 1366x768.
So send a single tile 512x512, you get the highest comprehension of all with no AI confusion about which image is being referred to. No odd size wasting or multi-tile blending. Then specify “low” to ensure there is no token overbilling.