Help understand token usage with vision API

I was trying this the example from here.

This is the code

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response)

The prompt_tokens for detail high is 36847 and for low it's 2846. Can anyone help me understand it? Why such a high difference? Thanks in advance!

I came here wondering the same thing. In my use case, the input token usage was close to 48,000 and I’m not sure why. It doesn’t match what I read on the docs about vision pricing. Can someone help us understand how this works?

If you use gpt-4o-mini for vision, the reported and billed token usage for images is being multiplied by a large scalar.

This makes images no cheaper than if you just made calls to gpt-4o. OpenAI may be motivated to not make image input processing cheap and accessible for business and competitive reasons.

Using the pricing calculator, you see you are actually charged twice as much per image if using mini.

There is no direct documentation of this behavior other than the pricing page, but instead describing only tokens the standard way on a page that includes mention of mini, leading many to far underestimate the costs of sending images to the mini model.

1 Like

Thank you so much for explaining that! I guess I’ll use the regular model for all my vision related completions. Specifically, I see that gpt-4o-2024-08-06 model is half the cost of the mini model. The gpt-4o model costs the same as the mini