Unexpectedly High Token Count When Using Image Inputs with gpt-4o-mini

Hi everyone,

I came across something odd when testing the image input capabilities via the api. According to the OpenAI documentation, image token usage should be reasonable and proportional to the image processing effort. However, when I ran the example curl command from the docs (with two identical images), I got a total token usage of 36,912 tokens.

Here is the exact request I used (copied from the docs):

curl https://api.openai.com/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_KEY" \
  -d '{
    "model": "gpt-4o-mini",
    "input": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "what is in this image?"},
          {
            "type": "input_image",
            "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
          }
        ]
      }
    ]
  }'

And here’s the relevant part of the response:

"usage": {
  "input_tokens": 36848,
  "output_tokens": 64,
  "total_tokens": 36912
}

The output from the model was a short and simple description of the image, so the output_tokens value makes sense. But nearly 37k input tokens for an image with 2560 × 1669 px seems excessive.

Is this expected behavior? Could it be a bug or miscalculation in the token estimation for image inputs?

Would love to hear if others are seeing the same.

Thanks!

You have discovered the way it works.

Thanks for the link and clarification!

That definitely explains the high token usage. However, it’s not very intuitive — especially since the official documentation uses gpt-4o-mini in the example. If using that model leads to inflated token costs by design, it might be helpful to either note that clearly in the docs or choose a different model for demonstration.

Thanks again for pointing this out!

1 Like

Is this the multimodal new image generator or dalle-3 36000 tokens seems not too much maybe especially if the X2 thing is true, what is the output token amount? Even if the input is high it might still be cheaper then 4o.