According to the pricing page, every image is resized (if too big) in order to fit in a 1024x1024 square, and is first globally described by 85 base tokens.
Tiles
To be fully recognized, an image is covered by 512x512 tiles.
Each tile provides 170 tokens. So, by default, the formula is the following:
total tokens = 85 + 170 * n, where n = the number of tiles needed to cover your image.
Implementation
This can be easily computed this way:
from math import ceil
def resize(width, height):
if width > 1024 or height > 1024:
if width > height:
height = int(height * 1024 / width)
width = 1024
else:
width = int(width * 1024 / height)
height = 1024
return width, height
def count_image_tokens(width: int, height: int):
width, height = resize(width, height)
h = ceil(height / 512)
w = ceil(width / 512)
total = 85 + 170 * h * w
return total
Some examples
- 500x500 → 1 tile is enough to cover this up, so total tokens = 85+170 = 255
- 513x500 → you need 2 tiles → total tokens = 85+170*2 = 425
- 513x513 → you need 4 tiles → total tokens = 85+170*4 = 765
low_resolution
mode
In “low resolution” mode, there is no tile; only the 85 base tokens remain, no matter the size of your image.