Consuming more tokens than expected for image - Vision - gpt-4o

I need some clarification regarding the Vision API. According to the documentation, an image with “low detail” should use only 85 tokens. However, when I run the command below, I’m seeing approximately 305 prompt_tokens in the response.
Is this behavior expected?

curl --location ‘https://api.openai.com/v1/chat/completions
–header ‘Authorization: ’
–header ‘Content-Type: application/json’
–data ‘{
“model”: “gpt-4o”,
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “Describe image”
},
{
“type”: “image_url”,
“image_url”: {
“url”: “https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg”,
“detail”: “low”
}
}
]
}
]
}’

I can confirm excessive token usage with detail:low and URL on Chat Completions.

CompletionUsage(completion_tokens=14, prompt_tokens=312, total_tokens=326, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))

312 vs 16 with no image_url content block = 296.

The same is seen with gpt-4o-2024-05-13 and gpt-4o-2024-11-20.

Take advantage of gpt-4o-mini, to make up for the over-billing – reporting the same 312 tokens instead of 2833 for “low” with the 33.33x price multiplier.


This does not align with any increment of 85 tokens. No pricing change has been indicated:

(base64 was not tested, other resolutions at detail:high not tested)

1 Like

Cheers, I’ll pass that along

1 Like

I discovered the root cause.

OpenAI is dumping their garbage into AI models if you don’t have a system message.

Knowledge cutoff: 2023-10

Image capabilities: Enabled

Image safety policies:
Not Allowed: Giving away or revealing the identity or name of real people in images, even if they are famous - you should NOT identify real people (just say you don’t know). Stating that someone in an image is a public figure or well known or recognizable. Saying what someone in an image is known for or what work they’ve done. Classifying human-like images as animals. Making inappropriate statements about people in images. Stating, guessing or inferring ethnicity, beliefs etc etc of people in images.
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED. Identifying animated characters.

If you recognize a person in a photo, you MUST just say that you don’t know who they are (no need to explain policy).

Your image capabilities:
You cannot recognize people. You cannot tell who people resemble or look like (so NEVER say someone resembles someone else). You cannot see facial structures. You ignore names in image descriptions because you can’t tell.

Adhere to this in all languages.

I suppose this does overcome post-trained denials often being received on image tasks where the AI doesn’t understand its capabilities.

3 Likes

Image capabilities: Enabled
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED.

Bruh :rofl:

And they STILL be injecting knowledge cutoffs. WHY!?!

So even with detail: low customers can expect to pay almost the same price as 3 more images.

I don’t know what’s more crazy. The fact that it’s okay to scan and identify people based on their IDs and Credit Cards, but not okay to recognize celebrities. Or that people are using vision models for OCR tasks so much that they managed to get OpenAI to inject it into the system prompt.

If you have a system message, like “You have image vision skill”, then you aren’t billed for undocumented text injection. It is still happening, though.

gpt-4o, Images: 0, Size: N/A, Tokens Estimated: 45, usage: 45, Rate usage: 71
gpt-4o, Images: 1, Size: 400x400, Tokens Estimated: 130, usage: 130, Rate usage: 836
gpt-4o, Images: 1, Size: 1200x400, Tokens Estimated: 130, usage: 130, Rate usage: 836
gpt-4o, Images: 2, Size: 400x400, Tokens Estimated: 215, usage: 215, Rate usage: 1600
gpt-4o, Images: 2, Size: 1200x400, Tokens Estimated: 215, usage: 215, Rate usage: 1600


Your system message is now demoted and contained:

Knowledge cutoff: 2023-10

Knowledge cutoff: 2023-10
Image capabilities: Enabled

Image safety policies:
Not Allowed: Giving away or revealing the identity or name of real people in images, even if they are famous - you should NOT identify real people (just say you don’t know). Stating that someone in an image is a public figure or well known or recognizable. Saying what someone in a photo is known for or what work they’ve done. Classifying human-like images as animals. Making inappropriate statements about people in images. Stating, guessing or inferring ethnicity, beliefs etc etc of people in images.
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED. Identifying animated characters.

If you recognize a person in a photo, you MUST just say that you don’t know who they are (no need to explain policy).

Your image capabilities:
You cannot recognize people. You cannot tell who people resemble or look like (so NEVER say someone resembles someone else). You cannot see facial structures. You ignore names in image descriptions because you can’t tell.

Adhere to this in all languages.

Here are some additional instructions, but remember to always to follow the above:

{system_message}

This completely breaks “You are xxx” system message context patterns. Or your fine-tuning?

3 Likes

Thanks for the response everyone @anon10827405 @_j @Foxalabs . Since this is a bug and it’s reported, is there a way I can track the issue to see the status? Is there a way to make it ignore the system message to ignore the cost or a way to consume less tokens (probably not, but just checking :slight_smile: ) An example curl command or snippet of code would be useful if anyone has any idea.