from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}
],
max_tokens=300,
)
print(response)
The prompt_tokens for detail high is 36847 and for low it's 2846. Can anyone help me understand it? Why such a high difference? Thanks in advance!
I came here wondering the same thing. In my use case, the input token usage was close to 48,000 and I’m not sure why. It doesn’t match what I read on the docs about vision pricing. Can someone help us understand how this works?
If you use gpt-4o-mini for vision, the reported and billed token usage for images is being multiplied by a large scalar.
This makes images no cheaper than if you just made calls to gpt-4o. OpenAI may be motivated to not make image input processing cheap and accessible for business and competitive reasons.
Using the pricing calculator, you see you are actually charged twice as much per image if using mini.
There is no direct documentation of this behavior other than the pricing page, but instead describing only tokens the standard way on a page that includes mention of mini, leading many to far underestimate the costs of sending images to the mini model.
Thank you so much for explaining that! I guess I’ll use the regular model for all my vision related completions. Specifically, I see that gpt-4o-2024-08-06 model is half the cost of the mini model. The gpt-4o model costs the same as the mini
Yes, I also found out that the cost for GPT-4o-mini is twice the one for GPT-4o.
Even more, the API reports a different usage (same amount for both models), but actually the price calculator uses 5667 tokens per tile and base tokens of 2833 tokens.
This is by far too much, also compared to other services:
Gemini: Here’s how tokens are calculated for images:
Gemini 1.0 Pro Vision: Each image accounts for 258 tokens.
Gemini 1.5 Flash and Gemini 1.5 Pro: If both dimensions of an image are less than or equal to 384 pixels, then 258 tokens are used. If one dimension of an image is greater than 384 pixels, then the image is cropped into tiles. Each tile size defaults to the smallest dimension (width or height) divided by 1.5. If necessary, each tile is adjusted so that it’s not smaller than 256 pixels and not greater than 768 pixels. Each tile is then resized to 768x768 and uses 258 tokens.
Gemini 2.0 Flash: Image inputs with both dimensions <=384 pixels are counted as 258 tokens. Images larger in one or both dimensions are cropped and scaled as needed into tiles of 768x768 pixels, each counted as 258 tokens.
Anthropic: Calculate image costs
Each image you include in a request to Claude counts towards your token usage. To calculate the approximate cost, multiply the approximate number of image tokens by the [per-token price of the model] you’re using.
If your image does not need to be resized, you can estimate the number of tokens used through this algorithm: tokens = (width px * height px)/750
Here are examples of approximate tokenization and costs for different image sizes within our API’s size constraints based on Claude 3.5 Sonnet per-token price of $3 per million input tokens:
Image size
# of Tokens
Cost / image
Cost / 1K images
200x200 px(0.04 megapixels)
~54
~$0.00016
~$0.16
1000x1000 px(1 megapixel)
~1334
~$0.004
~$4.00
1092x1092 px(1.19 megapixels)
~1590
~$0.0048
~$4.80
OpenAI, please think about lower prices for the vision capabilities.
Of course it depends on what you are planning. I found that Gemini works on text interpretation on images quite well. I can not tell how good it is for other tasks.
But it is incredibly cheap. It has a free tier (which has a low rate limit and which uses your input for training the model) and a paid tier (without using data for training). The most modern one is currently (if I am correct) Gemini 2.0 flash, which is $0.10 for input (1MT) and $0.40 for output (1MT). The Gemini 1.5 Pro is a bit older (but larger) for $1.25 and $5.00 (1MT). But there are even cheaper models than those available.
So for input it takes at maximum 2304x2304 images, which result in 9 tiles. Each tile costs 258 token, so at maximum an image could cost 2322 tokens (but are usually around 1300 tokens for 2:3 images).
In cost for input, it will be for 2.0 Flash $0,0002322 and for 1.5 Pro it is $0,0029025 (at maximum, average will be around half of it). So 1000 images will cost $0,23 or $2,90 at maximum.
Compared with GPT-4o mini, the maximum image size is 768x2048, which results in 8 tiles x 5667 tokens + 2833 tokens = 48169 tokens, which translates to $0,00722535 which is 31x more expensive than Gemini 2.0 Flash.