Hi, how do I count how many number of tokens does the each image has when using gpt-4-vision-preview model?
According to the pricing page, every image is resized (if too big) in order to fit in a 1024x1024 square, and is first globally described by 85 base tokens.
Tiles
To be fully recognized, an image is covered by 512x512 tiles.
Each tile provides 170 tokens. So, by default, the formula is the following:
total tokens = 85 + 170 * n, where n = the number of tiles needed to cover your image.
Implementation
This can be easily computed this way:
from math import ceil
def resize(width, height):
if width > 1024 or height > 1024:
if width > height:
height = int(height * 1024 / width)
width = 1024
else:
width = int(width * 1024 / height)
height = 1024
return width, height
def count_image_tokens(width: int, height: int):
width, height = resize(width, height)
h = ceil(height / 512)
w = ceil(width / 512)
total = 85 + 170 * h * w
return total
Some examples
- 500x500 → 1 tile is enough to cover this up, so total tokens = 85+170 = 255
- 513x500 → you need 2 tiles → total tokens = 85+170*2 = 425
- 513x513 → you need 4 tiles → total tokens = 85+170*4 = 765
low_resolution
mode
In “low resolution” mode, there is no tile; only the 85 base tokens remain, no matter the size of your image.
This makes sense to me, except when you use the calculator it seems to be resizing images.
Like what is going on here in a 2048x2048 image:
Why is it resizing, and why is this 4 tiles and not 16 tiles?
They should put this in official documentation
Thanks!
def calculate_image_tokens(width: int, height: int):
if width > 2048 or height > 2048:
aspect_ratio = width / height
if aspect_ratio > 1:
width, height = 2048, int(2048 / aspect_ratio)
else:
width, height = int(2048 * aspect_ratio), 2048
if width >= height and height > 768:
width, height = int((768 / height) * width), 768
elif height > width and width > 768:
width, height = 768, int((768 / width) * height)
tiles_width = ceil(width / 512)
tiles_height = ceil(height / 512)
total_tokens = 85 + 170 * (tiles_width * tiles_height)
return total_tokens
This node.js code helped me calculate the actual tokens
function calculateVisionPricing(width, height, detail = "high") {
let newWidth = 768,
newHeight = 768;
let aspect_ratio;
if (detail === "low") {
return 85
}
if (width > 2048 || height > 2048) {
aspect_ratio = width / height;
if (aspect_ratio > 1) {
newWidth = 2048;
newHeight = parseInt(2048 / aspect_ratio);
} else {
newHeight = 2048;
newWidth = parseInt(2048 * aspect_ratio);
}
}
if (width >= height && height > 768) {
newWidth = Math.floor((768 / height) * width);
} else if (height > width && width > 768) {
newHeight = Math.floor((768 / width) * height);
}
const tiles_width = Math.ceil(newWidth / 512);
const tiles_height = Math.ceil(newHeight / 512);
const total_tokens = 85 + 170 * (tiles_width * tiles_height);
return total_tokens;
}
This answer is out of data now
Here is what I am using atm:
function calculateVisionPricing(width: number, height: number, detail: string = 'high'): number {
if (detail === 'low') {
return 85;
}
// Scale down to fit within a 2048 x 2048 square if necessary
if (width > 2048 || height > 2048) {
const maxSize = 2048;
const aspectRatio = width / height;
if (aspectRatio > 1) {
width = maxSize;
height = parseInt(String(maxSize / aspectRatio));
} else {
height = maxSize;
width = parseInt(String(maxSize * aspectRatio));
}
}
// Resize such that the shortest side is 768px if the original dimensions exceed 768px
const minSize = 768;
const aspectRatio = width / height;
if (width > minSize && height > minSize) {
if (aspectRatio > 1) {
height = minSize;
width = parseInt(String(minSize * aspectRatio));
} else {
width = minSize;
height = parseInt(String(minSize / aspectRatio));
}
}
const tilesWidth = Math.ceil(width / 512);
const tilesHeight = Math.ceil(height / 512);
return 85 + 170 * (tilesWidth * tilesHeight);
}
function runTests() {
const testCases = [
{ width: 128, height: 128, detail: 'high', expected: 255 },
{ width: 512, height: 512, detail: 'high', expected: 255 },
{ width: 612, height: 134, detail: 'high', expected: 425 },
{ width: 767, height: 767, detail: 'high', expected: 765 },
{ width: 900, height: 767, detail: 'high', expected: 765 },
{ width: 900, height: 900, detail: 'high', expected: 765 },
{ width: 3000, height: 1200, detail: 'high', expected: 1445 },
{ width: 3000, height: 5000, detail: 'high', expected: 1105 },
{ width: 4096, height: 8192, detail: 'low', expected: 85 },
];
let allTestsPassed = true;
for (const test of testCases) {
const { width, height, detail, expected } = test;
const result = calculateVisionPricing(width, height, detail);
const passed = result === expected;
allTestsPassed = allTestsPassed && passed;
console.log(`Test ${passed ? 'PASSED' : 'FAILED'}: width=${width}, height=${height}, detail=${detail}, expected=${expected}, got=${result}`);
}
if (allTestsPassed) {
console.log('All tests passed!');
} else {
console.log('Some tests failed.');
}
}
Here is the Python version of the above code, originally written by @avemeva:
def calculate_vision_pricing(
width: int, height: int, detail: str = "high"
) -> int:
if detail == "low":
return 85
# Scale down to fit within a 2048 x 2048 square if necessary
if width > 2048 or height > 2048:
max_size = 2048
aspect_ratio = width / height
if aspect_ratio > 1:
width = max_size
height = int(max_size / aspect_ratio)
else:
height = max_size
width = int(max_size * aspect_ratio)
# Resize such that the shortest side is 768px if the original dimensions exceed 768px
min_size = 768
aspect_ratio = width / height
if width > min_size and height > min_size:
if aspect_ratio > 1:
height = min_size
width = int(min_size * aspect_ratio)
else:
width = min_size
height = int(min_size / aspect_ratio)
tiles_width = -(-width // 512) # Ceiling division
tiles_height = -(-height // 512)
return 85 + 170 * (tiles_width * tiles_height)
def run_tests():
test_cases = [
{"width": 128, "height": 128, "detail": "high", "expected": 255},
{"width": 512, "height": 512, "detail": "high", "expected": 255},
{"width": 612, "height": 134, "detail": "high", "expected": 425},
{"width": 767, "height": 767, "detail": "high", "expected": 765},
{"width": 900, "height": 767, "detail": "high", "expected": 765},
{"width": 900, "height": 900, "detail": "high", "expected": 765},
{"width": 3000, "height": 1200, "detail": "high", "expected": 1445},
{"width": 3000, "height": 5000, "detail": "high", "expected": 1105},
{"width": 4096, "height": 8192, "detail": "low", "expected": 85},
]
all_tests_passed = True
for test in test_cases:
width = test["width"]
height = test["height"]
detail = test["detail"]
expected = test["expected"]
result = calculate_vision_pricing(width, height, detail)
passed = result == expected
all_tests_passed = all_tests_passed and passed
print(
f"Test {'PASSED' if passed else 'FAILED'}: width={width}, height={height}, detail={detail}, expected={expected}, got={result}"
)
if all_tests_passed:
print("All tests passed!")
else:
print("Some tests failed.")
if __name__ == "__main__":
run_tests()
Does anyone know anything about how the blank space is handled in tiles which only partially cover an image? For example from the diagrams in this blog post (OpenAI Visual Tokenizer Explained | by Tee Kai Feng | Medium) we can see that there will be “blank” space inside tiles which don’t fully cover the image
It is a good question. Proprietary technology, unlikely to be ever offered with a fulfilling technical answer. An open-weight OpenAI model with vision might give more clues to the encoding…if ever released and technology mirrors billings.
gpt-4.1-mini exposes small “patches” of semantic understanding (where the units do relate to perception), but we don’t know if “tiles” is really just a billing convention for underlying technology that doesn’t provide the input as absolutely 512x512px units.
For exploration, a rectangular image was provided to many tiles models at 500x750, a person on neutral background. The AI vision models then, even prodded with suggestion, have no concept that it is not a single image being viewed, and denies seeing more than the image content (aka blank semantic information). An example response of this prodding:
Based on the underlying AI context window, the image is presented as a single unit (tile) without subdivision into multiple sub-sections. The entire content is visible in one continuous section.
Contents of the section:
- The central area contains a person with unusual features, including a large, textured, purple object attached to their head and intricate patterns on their clothing.
- The background is a uniform off-white color, providing high contrast to the central figure.
- The figure occupies the majority of the vertical space, with the head and upper torso centered.
Overlap and blank area:
- There is no overlap between sections, as only one section (tile) is present.
- The blank area (off-white background) surrounds the figure, especially at the top, sides, and a small portion at the bottom. Approximately 25-30% of the image area is blank (background) without semantic information, mostly around the edges.
- The remaining 70-75% contains the main content (the figure and their attire).
Understanding of the entire image:
- The image is fully visible in one tile, with no missing or repeated content.
- The context window provides a complete and coherent view of the image, allowing for a clear understanding of the central subject and the background.
What is documented is the two downsize operations, using maximums for longer and shorter image dimensions. Which you should do yourself for bandwidth concerns and with a content-aware resampler, so you control more of the final input.
Right, I have an image (just a picture of cat) which after the resizing and tiling is almost 50% image and 50% blank tile (it’s a 2 by 3 tile image). The model does not seem to be “aware” of any padding around the image if prompted.
Anyway, thanks for your response.
Also, I’m trying to extract the location of the cat in the image (tried bounding box or grid coordinates). Sometimes GPT4o will explicitly say that it can’t do this and that I should use an image detection model instead.
Also it says this only sometimes, even though temperature is set to zero. I suppose setting temp to zero doesn’t give deterministic outputs with image input (floating point and everything aside). I have found it does with text input.
Another possibility is that there is no padding. The tiles might overlap to cover the image.