Welcome to the Community! I suggest providing the URL or the image object directly instead of using base64 encoding. Base64 encoding converts the image into a large string, which significantly increases the number of tokens processed.
The number of tokens is determined by the base tile and the number of high detail tiles used for the image size at the given detail parameter.
You would only have ridulous token counts if you were not sending the image correctly as an object as part of a content array as specified in the API reference. Then the AI wouldn’t be able to see it anyway.
your code looks okay so my guess is you might be probably sending this to a non-vision model. although on second thought, you never mentioned any error.
I’m also having this issue!
Are there any news regarding this?
uploading each image onto some webserver is not feasible for my use case.
I also don’t see why it should be more expensive to analyze embedded image over image from url.
my test results (given an 300kb image of a duck with 1600x1067 px)
my query:
{
"model": "gpt-4o-mini",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "[...]UBQBoBGvdUf//Z",
"detail": "auto"
}
},
{
"type": "text",
"text": "Describe the content of this image"
}
]
}
],
"max_tokens": 100,
"temperature": 0.7
}
ChatGptApi tokens usages (Prompt: 36858 + Completion: 66 = Total: 36924)
I get a correct answer, but the token usage is insane.
The input token consumption by gpt-4o-mini is to ensure that there is no “cheap vision” AI model. In fact, gpt-4o-mini costs twice as much for image input as simply gpt-4o.
The input token cost of an image is multiplied by 33.33x on gpt-4o-mini.
85 tokens of a “detail:low” image sent to gpt-4o becomes 2833 billed by gpt-4o-mini. Then your 6 tile + base image you are sending at detail:high, with 13x the tile and token cost of detail:low becomes the billed token cost you see.
It is not more expensive to use base64. What you should do is use the better quality model at lower image cost, and the whole API call might be cheaper too.