I’m trying to calculate the cost per image processed using Vision with GPT-4o. I’m passing a series of jpg files as content in low detail:
history = []
num_prompt_tokens = 0
num_completion_tokens = 0
num_total_tokens = 0
for filename, file_content in file_contents.items():
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{file_content}",
"detail": "low"
},
},
],
}
],
max_tokens=256,
)
num_prompt_tokens += response.usage.prompt_tokens
num_completion_tokens += response.usage.completion_tokens
num_total_tokens += response.usage.total_tokens
history.append(response.choices[0].message.content)
In the pricing information (https://platform.openai.com/docs/guides/vision), it says all low detail images cost 85 tokens. However, in response.usage.prompt_tokens, it racks up 12077 tokens total. The user prompt is only 156 input, so would be at most 13x156=2028. Where are the extra prompt tokens coming from; am I safe to use the (13x85)+(13x156)=3133 as total input tokens? Which one is the correct amount to base costs off?
