How do I calculate image tokens in GPT4 Vision?

According to the pricing page, every image is resized (if too big) in order to fit in a 1024x1024 square, and is first globally described by 85 base tokens.

Tiles

To be fully recognized, an image is covered by 512x512 tiles.
Each tile provides 170 tokens. So, by default, the formula is the following:
total tokens = 85 + 170 * n, where n = the number of tiles needed to cover your image.

Implementation

This can be easily computed this way:

from math import ceil

def resize(width, height):
    if width > 1024 or height > 1024:
        if width > height:
            height = int(height * 1024 / width)
            width = 1024
        else:
            width = int(width * 1024 / height)
            height = 1024
    return width, height

def count_image_tokens(width: int, height: int):
    width, height = resize(width, height)
    h = ceil(height / 512)
    w = ceil(width / 512)
    total = 85 + 170 * h * w
    return total

Some examples

  • 500x500 → 1 tile is enough to cover this up, so total tokens = 85+170 = 255
  • 513x500 → you need 2 tiles → total tokens = 85+170*2 = 425
  • 513x513 → you need 4 tiles → total tokens = 85+170*4 = 765

low_resolution mode

In “low resolution” mode, there is no tile; only the 85 base tokens remain, no matter the size of your image.

12 Likes