How do I calculate image tokens in GPT4 Vision?

Hi, how do I count how many number of tokens does the each image has when using gpt-4-vision-preview model?

1 Like

According to the pricing page, every image is resized (if too big) in order to fit in a 1024x1024 square, and is first globally described by 85 base tokens.

Tiles

To be fully recognized, an image is covered by 512x512 tiles.
Each tile provides 170 tokens. So, by default, the formula is the following:
total tokens = 85 + 170 * n, where n = the number of tiles needed to cover your image.

Implementation

This can be easily computed this way:

from math import ceil

def resize(width, height):
    if width > 1024 or height > 1024:
        if width > height:
            height = int(height * 1024 / width)
            width = 1024
        else:
            width = int(width * 1024 / height)
            height = 1024
    return width, height

def count_image_tokens(width: int, height: int):
    width, height = resize(width, height)
    h = ceil(height / 512)
    w = ceil(width / 512)
    total = 85 + 170 * h * w
    return total

Some examples

  • 500x500 → 1 tile is enough to cover this up, so total tokens = 85+170 = 255
  • 513x500 → you need 2 tiles → total tokens = 85+170*2 = 425
  • 513x513 → you need 4 tiles → total tokens = 85+170*4 = 765

low_resolution mode

In “low resolution” mode, there is no tile; only the 85 base tokens remain, no matter the size of your image.

12 Likes

This makes sense to me, except when you use the calculator it seems to be resizing images.

Like what is going on here in a 2048x2048 image:

Why is it resizing, and why is this 4 tiles and not 16 tiles?

1 Like

See: https://platform.openai.com/docs/guides/vision/calculating-costs

1 Like

They should put this in official documentation

Thanks!

def calculate_image_tokens(width: int, height: int):
    if width > 2048 or height > 2048:
        aspect_ratio = width / height
        if aspect_ratio > 1:
            width, height = 2048, int(2048 / aspect_ratio)
        else:
            width, height = int(2048 * aspect_ratio), 2048
            
    if width >= height and height > 768:
        width, height = int((768 / height) * width), 768
    elif height > width and width > 768:
        width, height = 768, int((768 / width) * height)

    tiles_width = ceil(width / 512)
    tiles_height = ceil(height / 512)
    total_tokens = 85 + 170 * (tiles_width * tiles_height)
    
    return total_tokens
1 Like

This node.js code helped me calculate the actual tokens

function calculateVisionPricing(width, height, detail = "high") {
  let newWidth = 768,
    newHeight = 768;
  let aspect_ratio;

  if (detail === "low") {
     return 85 
  }

  if (width > 2048 || height > 2048) {
    aspect_ratio = width / height;
    if (aspect_ratio > 1) {
      newWidth = 2048;
      newHeight = parseInt(2048 / aspect_ratio);
    } else {
      newHeight = 2048;
      newWidth = parseInt(2048 * aspect_ratio);
    }
  }
  if (width >= height && height > 768) {
    newWidth = Math.floor((768 / height) * width);
  } else if (height > width && width > 768) {
    newHeight = Math.floor((768 / width) * height);
  }

  const tiles_width = Math.ceil(newWidth / 512);
  const tiles_height = Math.ceil(newHeight / 512);
  const total_tokens = 85 + 170 * (tiles_width * tiles_height);
  return total_tokens;
}
2 Likes

This answer is out of data now

Here is what I am using atm:

function calculateVisionPricing(width: number, height: number, detail: string = 'high'): number {
  if (detail === 'low') {
    return 85;
  }

  // Scale down to fit within a 2048 x 2048 square if necessary
  if (width > 2048 || height > 2048) {
    const maxSize = 2048;
    const aspectRatio = width / height;
    if (aspectRatio > 1) {
      width = maxSize;
      height = parseInt(String(maxSize / aspectRatio));
    } else {
      height = maxSize;
      width = parseInt(String(maxSize * aspectRatio));
    }
  }

  // Resize such that the shortest side is 768px if the original dimensions exceed 768px
  const minSize = 768;
  const aspectRatio = width / height;
  if (width > minSize && height > minSize) {
    if (aspectRatio > 1) {
      height = minSize;
      width = parseInt(String(minSize * aspectRatio));
    } else {
      width = minSize;
      height = parseInt(String(minSize / aspectRatio));
    }
  }

  const tilesWidth = Math.ceil(width / 512);
  const tilesHeight = Math.ceil(height / 512);
  return 85 + 170 * (tilesWidth * tilesHeight);
}

function runTests() {
  const testCases = [
    { width: 128, height: 128, detail: 'high', expected: 255 },
    { width: 512, height: 512, detail: 'high', expected: 255 },
    { width: 612, height: 134, detail: 'high', expected: 425 },
    { width: 767, height: 767, detail: 'high', expected: 765 },
    { width: 900, height: 767, detail: 'high', expected: 765 },
    { width: 900, height: 900, detail: 'high', expected: 765 },
    { width: 3000, height: 1200, detail: 'high', expected: 1445 },
    { width: 3000, height: 5000, detail: 'high', expected: 1105 },
    { width: 4096, height: 8192, detail: 'low', expected: 85 },
  ];

  let allTestsPassed = true;

  for (const test of testCases) {
    const { width, height, detail, expected } = test;
    const result = calculateVisionPricing(width, height, detail);
    const passed = result === expected;
    allTestsPassed = allTestsPassed && passed;
    console.log(`Test ${passed ? 'PASSED' : 'FAILED'}: width=${width}, height=${height}, detail=${detail}, expected=${expected}, got=${result}`);
  }

  if (allTestsPassed) {
    console.log('All tests passed!');
  } else {
    console.log('Some tests failed.');
  }
}
1 Like

Here is the Python version of the above code, originally written by @avemeva:

def calculate_vision_pricing(
    width: int, height: int, detail: str = "high"
) -> int:
    if detail == "low":
        return 85

    # Scale down to fit within a 2048 x 2048 square if necessary
    if width > 2048 or height > 2048:
        max_size = 2048
        aspect_ratio = width / height
        if aspect_ratio > 1:
            width = max_size
            height = int(max_size / aspect_ratio)
        else:
            height = max_size
            width = int(max_size * aspect_ratio)

    # Resize such that the shortest side is 768px if the original dimensions exceed 768px
    min_size = 768
    aspect_ratio = width / height
    if width > min_size and height > min_size:
        if aspect_ratio > 1:
            height = min_size
            width = int(min_size * aspect_ratio)
        else:
            width = min_size
            height = int(min_size / aspect_ratio)

    tiles_width = -(-width // 512)  # Ceiling division
    tiles_height = -(-height // 512)
    return 85 + 170 * (tiles_width * tiles_height)


def run_tests():
    test_cases = [
        {"width": 128, "height": 128, "detail": "high", "expected": 255},
        {"width": 512, "height": 512, "detail": "high", "expected": 255},
        {"width": 612, "height": 134, "detail": "high", "expected": 425},
        {"width": 767, "height": 767, "detail": "high", "expected": 765},
        {"width": 900, "height": 767, "detail": "high", "expected": 765},
        {"width": 900, "height": 900, "detail": "high", "expected": 765},
        {"width": 3000, "height": 1200, "detail": "high", "expected": 1445},
        {"width": 3000, "height": 5000, "detail": "high", "expected": 1105},
        {"width": 4096, "height": 8192, "detail": "low", "expected": 85},
    ]

    all_tests_passed = True

    for test in test_cases:
        width = test["width"]
        height = test["height"]
        detail = test["detail"]
        expected = test["expected"]
        result = calculate_vision_pricing(width, height, detail)
        passed = result == expected
        all_tests_passed = all_tests_passed and passed
        print(
            f"Test {'PASSED' if passed else 'FAILED'}: width={width}, height={height}, detail={detail}, expected={expected}, got={result}"
        )

    if all_tests_passed:
        print("All tests passed!")
    else:
        print("Some tests failed.")


if __name__ == "__main__":
    run_tests()

Does anyone know anything about how the blank space is handled in tiles which only partially cover an image? For example from the diagrams in this blog post (OpenAI Visual Tokenizer Explained | by Tee Kai Feng | Medium) we can see that there will be “blank” space inside tiles which don’t fully cover the image

1 Like

It is a good question. Proprietary technology, unlikely to be ever offered with a fulfilling technical answer. An open-weight OpenAI model with vision might give more clues to the encoding…if ever released and technology mirrors billings.

gpt-4.1-mini exposes small “patches” of semantic understanding (where the units do relate to perception), but we don’t know if “tiles” is really just a billing convention for underlying technology that doesn’t provide the input as absolutely 512x512px units.

For exploration, a rectangular image was provided to many tiles models at 500x750, a person on neutral background. The AI vision models then, even prodded with suggestion, have no concept that it is not a single image being viewed, and denies seeing more than the image content (aka blank semantic information). An example response of this prodding:

Based on the underlying AI context window, the image is presented as a single unit (tile) without subdivision into multiple sub-sections. The entire content is visible in one continuous section.

Contents of the section:

  • The central area contains a person with unusual features, including a large, textured, purple object attached to their head and intricate patterns on their clothing.
  • The background is a uniform off-white color, providing high contrast to the central figure.
  • The figure occupies the majority of the vertical space, with the head and upper torso centered.

Overlap and blank area:

  • There is no overlap between sections, as only one section (tile) is present.
  • The blank area (off-white background) surrounds the figure, especially at the top, sides, and a small portion at the bottom. Approximately 25-30% of the image area is blank (background) without semantic information, mostly around the edges.
  • The remaining 70-75% contains the main content (the figure and their attire).

Understanding of the entire image:

  • The image is fully visible in one tile, with no missing or repeated content.
  • The context window provides a complete and coherent view of the image, allowing for a clear understanding of the central subject and the background.

What is documented is the two downsize operations, using maximums for longer and shorter image dimensions. Which you should do yourself for bandwidth concerns and with a content-aware resampler, so you control more of the final input.

Right, I have an image (just a picture of cat) which after the resizing and tiling is almost 50% image and 50% blank tile (it’s a 2 by 3 tile image). The model does not seem to be “aware” of any padding around the image if prompted.

Anyway, thanks for your response.

Also, I’m trying to extract the location of the cat in the image (tried bounding box or grid coordinates). Sometimes GPT4o will explicitly say that it can’t do this and that I should use an image detection model instead.

Also it says this only sometimes, even though temperature is set to zero. I suppose setting temp to zero doesn’t give deterministic outputs with image input (floating point and everything aside). I have found it does with text input.

Another possibility is that there is no padding. The tiles might overlap to cover the image.