How do I calculate image tokens in GPT4 Vision?

Hi, how do I count how many number of tokens does the each image has when using gpt-4-vision-preview model?

According to the pricing page, every image is first globally described by 85 base tokens.

Tiles

To be fully recognized, an image is covered by 512x512 tiles.
Each tile provides 170 tokens. So, by default, the formula is the following:
total tokens = 85 + 170 * n, where n = the number of tiles needed to cover your image.

Implementation

This can be easily computed this way:

from math import ceil

def count_image_tokens(width: int, height: int):
    h = ceil(height / 512)
    w = ceil(width / 512)
    n = w * h
    total = 85 + 170 * n
    return total

or in one line if you prefer:

count_total_tokens = lambda w, h: 85 + 170 * ceil(w / 512) * ceil(h / 512)

Some examples

  • 500x500 → 1 tile is enough to cover this up, so total tokens = 85+170 = 255
  • 513x500 → you need 2 tiles → total tokens = 85+170*2 = 425
  • 513x513 → you need 4 tiles → total tokens = 85+170*2 = 765

low_resolution mode

In “low resolution” mode, there is no tile; only the 85 base tokens remain, no matter the size of your image.

5 Likes

This makes sense to me, except when you use the calculator it seems to be resizing images.

Like what is going on here in a 2048x2048 image:

Why is it resizing, and why is this 4 tiles and not 16 tiles?