Help understand token usage with vision API

prms · August 4, 2024, 8:21am

I was trying this the example from here.

This is the code

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response)

The prompt_tokens for detail high is 36847 and for low it's 2846. Can anyone help me understand it? Why such a high difference? Thanks in advance!

khunicycler · September 13, 2024, 5:04am

I came here wondering the same thing. In my use case, the input token usage was close to 48,000 and I’m not sure why. It doesn’t match what I read on the docs about vision pricing. Can someone help us understand how this works?

_j · September 13, 2024, 5:28am

If you use gpt-4o-mini for vision, the reported and billed token usage for images is being multiplied by a large scalar.

This makes images no cheaper than if you just made calls to gpt-4o. OpenAI may be motivated to not make image input processing cheap and accessible for business and competitive reasons.

Using the pricing calculator, you see you are actually charged twice as much per image if using mini.

There is no direct documentation of this behavior other than the pricing page, but instead describing only tokens the standard way on a page that includes mention of mini, leading many to far underestimate the costs of sending images to the mini model.

khunicycler · September 13, 2024, 4:32pm

Thank you so much for explaining that! I guess I’ll use the regular model for all my vision related completions. Specifically, I see that gpt-4o-2024-08-06 model is half the cost of the mini model. The gpt-4o model costs the same as the mini

Celebrandt · February 7, 2025, 6:19pm

Yes, I also found out that the cost for GPT-4o-mini is twice the one for GPT-4o.
Even more, the API reports a different usage (same amount for both models), but actually the price calculator uses 5667 tokens per tile and base tokens of 2833 tokens.
This is by far too much, also compared to other services:

Gemini: Here’s how tokens are calculated for images:

Gemini 1.0 Pro Vision: Each image accounts for 258 tokens.
Gemini 1.5 Flash and Gemini 1.5 Pro: If both dimensions of an image are less than or equal to 384 pixels, then 258 tokens are used. If one dimension of an image is greater than 384 pixels, then the image is cropped into tiles. Each tile size defaults to the smallest dimension (width or height) divided by 1.5. If necessary, each tile is adjusted so that it’s not smaller than 256 pixels and not greater than 768 pixels. Each tile is then resized to 768x768 and uses 258 tokens.
Gemini 2.0 Flash: Image inputs with both dimensions <=384 pixels are counted as 258 tokens. Images larger in one or both dimensions are cropped and scaled as needed into tiles of 768x768 pixels, each counted as 258 tokens.

Anthropic: Calculate image costs

Each image you include in a request to Claude counts towards your token usage. To calculate the approximate cost, multiply the approximate number of image tokens by the [per-token price of the model] you’re using.

If your image does not need to be resized, you can estimate the number of tokens used through this algorithm: tokens = (width px * height px)/750

Here are examples of approximate tokenization and costs for different image sizes within our API’s size constraints based on Claude 3.5 Sonnet per-token price of $3 per million input tokens:

Image size	# of Tokens	Cost / image	Cost / 1K images
200x200 px(0.04 megapixels)	~54	~$0.00016	~$0.16
1000x1000 px(1 megapixel)	~1334	~$0.004	~$4.00
1092x1092 px(1.19 megapixels)	~1590	~$0.0048	~$4.80

OpenAI, please think about lower prices for the vision capabilities.

benjiscollector · February 12, 2025, 11:59am

From your research, which models provider gave the cheapest rates?

_j · February 12, 2025, 12:01pm

There’s tons of vision models, but none quite like GPT-4-Vision and the intelligence powering it.

It is quite cheap to get unreliable outputs that say [150,150,300,400]: dog

Celebrandt · February 12, 2025, 9:37pm

Of course it depends on what you are planning. I found that Gemini works on text interpretation on images quite well. I can not tell how good it is for other tasks.

But it is incredibly cheap. It has a free tier (which has a low rate limit and which uses your input for training the model) and a paid tier (without using data for training). The most modern one is currently (if I am correct) Gemini 2.0 flash, which is $0.10 for input (1MT) and $0.40 for output (1MT). The Gemini 1.5 Pro is a bit older (but larger) for $1.25 and $5.00 (1MT). But there are even cheaper models than those available.

So for input it takes at maximum 2304x2304 images, which result in 9 tiles. Each tile costs 258 token, so at maximum an image could cost 2322 tokens (but are usually around 1300 tokens for 2:3 images).
In cost for input, it will be for 2.0 Flash $0,0002322 and for 1.5 Pro it is $0,0029025 (at maximum, average will be around half of it). So 1000 images will cost $0,23 or $2,90 at maximum.

Compared with GPT-4o mini, the maximum image size is 768x2048, which results in 8 tiles x 5667 tokens + 2833 tokens = 48169 tokens, which translates to $0,00722535 which is 31x more expensive than Gemini 2.0 Flash.

Topic		Replies	Views
Token Usage for Images Remains Constant Regardless of Size - Is This a Bug? API	6	5558	September 23, 2024
GPT-4-o-Mini Vision Token Cost Issue API gpt-4-vision , cost	2	1089	March 26, 2025
Gpt-4o-mini consumes more than 20 times tokens for images than gpt-4 API gpt-4o-mini	4	4894	August 20, 2024
Super-high token usage with gpt-4o-mini and image API playground , gpt-4o-mini	10	12364	August 22, 2024
GPT-4o-mini high vision cost API	33	10911	August 2, 2024

Help understand token usage with vision API

Related topics