Unexpected token length for vision

Hi folks,
I’m being charged about 40000 tokens per image. Here is my code, could anyone please help?

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "what is this document about?",
            }
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "data:image/jpeg;base64,/9j/4AAQ......",
                },
            }
        ],
    },
]

Hi @marco.lai.c.l :wave:

Welcome to the dev forum.

Can you share what indicates this?

2 Likes

Hello @marco.lai.c.l

Welcome to the Community! I suggest providing the URL or the image object directly instead of using base64 encoding. Base64 encoding converts the image into a large string, which significantly increases the number of tokens processed.

As shown here:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

Or if you have to proide the base_64 please use the

image_path

and as you are using

image_url

That is why you are being charged this much

Please see the referene below for base_64

import base64
import requests

# OpenAI API Key
api_key = "YOUR_OPENAI_API_KEY"

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

I hope that this helps: Vision - OpenAI API

That is absolutely not how it works.

The number of tokens is determined by the base tile and the number of high detail tiles used for the image size at the given detail parameter.

You would only have ridulous token counts if you were not sending the image correctly as an object as part of a content array as specified in the API reference. Then the AI wouldn’t be able to see it anyway.

1 Like

I see thank you for the correction but as I just saw the base_64 and the image_url being used I though that was why.

your code looks okay so my guess is you might be probably sending this to a non-vision model. although on second thought, you never mentioned any error.

I’m also having this issue!
Are there any news regarding this?

uploading each image onto some webserver is not feasible for my use case.
I also don’t see why it should be more expensive to analyze embedded image over image from url.

my test results (given an 300kb image of a duck with 1600x1067 px)
my query:

{
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4TEiRXhpZgAASUkqAAgAAAA[...]UBQBoBGvdUf//Z",
            "detail": "auto"
          }
        },
        {
          "type": "text",
          "text": "Describe the content of this image"
        }
      ]
    }
  ],
  "max_tokens": 100,
  "temperature": 0.7
}

ChatGptApi tokens usages (Prompt: 36858 + Completion: 66 = Total: 36924)
I get a correct answer, but the token usage is insane.

The input token consumption by gpt-4o-mini is to ensure that there is no “cheap vision” AI model. In fact, gpt-4o-mini costs twice as much for image input as simply gpt-4o.

The input token cost of an image is multiplied by 33.33x on gpt-4o-mini.

85 tokens of a “detail:low” image sent to gpt-4o becomes 2833 billed by gpt-4o-mini. Then your 6 tile + base image you are sending at detail:high, with 13x the tile and token cost of detail:low becomes the billed token cost you see.

It is not more expensive to use base64. What you should do is use the better quality model at lower image cost, and the whole API call might be cheaper too.