Consuming more tokens than expected for image - Vision - gpt-4o

I need some clarification regarding the Vision API. According to the documentation, an image with “low detail” should use only 85 tokens. However, when I run the command below, I’m seeing approximately 305 prompt_tokens in the response.
Is this behavior expected?

curl --location ‘https://api.openai.com/v1/chat/completions
–header ‘Authorization: ’
–header ‘Content-Type: application/json’
–data ‘{
“model”: “gpt-4o”,
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “Describe image”
},
{
“type”: “image_url”,
“image_url”: {
“url”: “https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg”,
“detail”: “low”
}
}
]
}
]
}’

1 Like

I can confirm excessive token usage with detail:low and URL on Chat Completions.

CompletionUsage(completion_tokens=14, prompt_tokens=312, total_tokens=326, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))

312 vs 16 with no image_url content block = 296.

The same is seen with gpt-4o-2024-05-13 and gpt-4o-2024-11-20.

Take advantage of gpt-4o-mini, to make up for the over-billing – reporting the same 312 tokens instead of 2833 for “low” with the 33.33x price multiplier.


This does not align with any increment of 85 tokens. No pricing change has been indicated:

(base64 was not tested, other resolutions at detail:high not tested)

2 Likes

Cheers, I’ll pass that along

1 Like

I discovered the root cause.

OpenAI is dumping their garbage into AI models if you don’t have a system message.

Knowledge cutoff: 2023-10

Image capabilities: Enabled

Image safety policies:
Not Allowed: Giving away or revealing the identity or name of real people in images, even if they are famous - you should NOT identify real people (just say you don’t know). Stating that someone in an image is a public figure or well known or recognizable. Saying what someone in an image is known for or what work they’ve done. Classifying human-like images as animals. Making inappropriate statements about people in images. Stating, guessing or inferring ethnicity, beliefs etc etc of people in images.
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED. Identifying animated characters.

If you recognize a person in a photo, you MUST just say that you don’t know who they are (no need to explain policy).

Your image capabilities:
You cannot recognize people. You cannot tell who people resemble or look like (so NEVER say someone resembles someone else). You cannot see facial structures. You ignore names in image descriptions because you can’t tell.

Adhere to this in all languages.

I suppose this does overcome post-trained denials often being received on image tasks where the AI doesn’t understand its capabilities.

3 Likes

Image capabilities: Enabled
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED.

Bruh :rofl:

And they STILL be injecting knowledge cutoffs. WHY!?!

So even with detail: low customers can expect to pay almost the same price as 3 more images.

I don’t know what’s more crazy. The fact that it’s okay to scan and identify people based on their IDs and Credit Cards, but not okay to recognize celebrities. Or that people are using vision models for OCR tasks so much that they managed to get OpenAI to inject it into the system prompt.

If you have a system message, like “You have image vision skill”, then you aren’t billed for undocumented text injection. It is still happening, though.

gpt-4o, Images: 0, Size: N/A, Tokens Estimated: 45, usage: 45, Rate usage: 71
gpt-4o, Images: 1, Size: 400x400, Tokens Estimated: 130, usage: 130, Rate usage: 836
gpt-4o, Images: 1, Size: 1200x400, Tokens Estimated: 130, usage: 130, Rate usage: 836
gpt-4o, Images: 2, Size: 400x400, Tokens Estimated: 215, usage: 215, Rate usage: 1600
gpt-4o, Images: 2, Size: 1200x400, Tokens Estimated: 215, usage: 215, Rate usage: 1600


Your system message is now demoted and contained:

Knowledge cutoff: 2023-10

Knowledge cutoff: 2023-10
Image capabilities: Enabled

Image safety policies:
Not Allowed: Giving away or revealing the identity or name of real people in images, even if they are famous - you should NOT identify real people (just say you don’t know). Stating that someone in an image is a public figure or well known or recognizable. Saying what someone in a photo is known for or what work they’ve done. Classifying human-like images as animals. Making inappropriate statements about people in images. Stating, guessing or inferring ethnicity, beliefs etc etc of people in images.
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED. Identifying animated characters.

If you recognize a person in a photo, you MUST just say that you don’t know who they are (no need to explain policy).

Your image capabilities:
You cannot recognize people. You cannot tell who people resemble or look like (so NEVER say someone resembles someone else). You cannot see facial structures. You ignore names in image descriptions because you can’t tell.

Adhere to this in all languages.

Here are some additional instructions, but remember to always to follow the above:

{system_message}

This completely breaks “You are xxx” system message context patterns. Or your fine-tuning?

4 Likes

Thanks for the response everyone @anon10827405 @_j @Foxalabs . Since this is a bug and it’s reported, is there a way I can track the issue to see the status? Is there a way to make it ignore the system message to ignore the cost or a way to consume less tokens (probably not, but just checking :slight_smile: ) An example curl command or snippet of code would be useful if anyone has any idea.

1 Like

I’m encountering the same issue. For me, even adding a system/developer message doesn’t fix it. Given this behavior, the “calculating costs” section of the Vision documentation is just false. I would also be happy about some way to track progress on this.

Hi. Wow, welcome back four years later!

I just made several requests, and it seems the billing is “patched up” and fixed from this topic’s concern. Omitting any message other than the “user” doesn’t amplify the cost.

What model are you sending to?

OpenAI also “fixed” the temporary cheap billing bug of gpt-4o-mini I reported earlier. It is back to the input token cost of images being multiplied by 33.3x. Don’t use this model if you wish to save money on image tasks.

An image along with 3 tokens of user text only, sent to gpt-4o:
completion_tokens=1, prompt_tokens=95, total_tokens=96, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)

Wow, quick reply!

This is my request:

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Do you see a bathtub? Answer only 'yes' or 'no'."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "low",
                        },
                    },
                ],
            }
        ],
        max_tokens=1,
    )

The image has 512x512 px.

I get back:

ChatCompletion(id='chatcmpl-B9HWRM1rb22loOIJw4Ltkq6lfkByN', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Yes', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1741552007, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier='default', system_fingerprint='fp_7fcd609668', usage=CompletionUsage(completion_tokens=1, prompt_tokens=2855, total_tokens=2856, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=2688)))

Those 2855 prompt tokens seem excessive, and I assumed they were due to the added system prompt. But maybe it is something else I am doing wrong here?

It is back to the input token cost of images being multiplied by 33.3x.

is that intentional? It’s not what’s stated in the documentation, so I’m a bit confused.

With gpt-4o, the input token count is as expected, and as an end result, they both cost basically exactly the same.

I don’t see how the cheap billing is a bug if it is the cost described on the website. But apparently, this is what OpenAI just does when calculating the tokens for vision with 4o-mini. Would be good to be transparent about it.

The pricing page has recently been obfuscated. It used to have a separate image pricing calculator for each of gpt-4o and gpt-4o-mini under each model. Now you have to go to the very bottom and expand “how is pricing calculated for images”.

You could put the same image dimensions into each, and see that gpt-4o-mini is costing you twice as much (or the same image price as the more expensive model it was not supposed to compete against, gpt-4o-2024-05-13)

So the token count inflation is the “working as intended”.

You’er right it would be nice to have a choice like (this pic will cost you 100 tokens Y or N) before you hit yes. I think Rogers has infected the system.