Reading videos with GPT4V

_j · November 24, 2023, 6:42am

https://platform.openai.com/docs/guides/vision/calculating-costs

The image you send is first downsized so that the longest dimension is 2048 or under.

The image is again downsized so the shortest dimension is 768 or under.

Then a tile grid is split over that image … tiles which are 512x512. This means that a “photo grid” like is shown in the first image at 910x910 would be downsized to 768x768. That takes up the area of four tiles, with 25% of the resulting 1024x1024 grid unused (or overlapped). If you were to send a 720p image (or 3x3 in its size) that’s a processed area of 1280x720, six tiles, and unless you make a double-width image, full 1920x1080 will be downsized similarly, to 1366x768.

So send a single tile 512x512, you get the highest comprehension of all with no AI confusion about which image is being referred to. No odd size wasting or multi-tile blending. Then specify “low” to ensure there is no token overbilling.

Topic		Replies	Views
GPT-4o Model: Image Coordinate Recognition API gpt-4	33	7265	August 26, 2025
Announcing GPT-4o in the API! Announcements	130	109371	July 4, 2024
Gpt-5, gpt-5-mini, and gpt-5-nano now available in the API Announcements gpt-5	60	7119	August 18, 2025
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	48724	December 12, 2023
40-mini says it can't see images API api	8	339	March 29, 2025

Reading videos with GPT4V

Related topics