Hi!
Note: I’ve read most of the threads about gpt-image-1
pricing, but I’m still stuck reconciling my usage with the exported billing report.
I’m using the Responses API to generate an image and classify results. Here’s the Python call:
response = await self.client.responses.parse(
model='gpt-4.1',
input=[
{"role": "system",
"content": dedent(self.system_prompt)},
{
"role": "user",
"content": [
{"type": "input_text", "text": user_prompt},
{"type": "input_image", "image_url": image_url},
],
},
],
tools=[{
"type": "image_generation",
"background": "transparent",
"input_fidelity": "low",
"quality": "medium",
"output_format": "png",
"moderation": "low",
"size": "1024x1024"
}],
text_format=ClassificationOutput
)
My local counts:
- system_prompt+user_prompt: 891 tokens (by tiktoken, o200k_base)
- image 1024x768: 129*2+65=323 input tokens to gpt-image-1 (low input fidelity, two tiles)
- pydantic json ClassificationOutput: ~300 tokens (by tiktoken, o200k_base)
Billing export (corresponding entries)
num_model_requests | model | input_tokens | output_tokens | batch | service_tier |
---|---|---|---|---|---|
1.0 | gpt-image-1-2025-04-23 | 470.0 | 1056.0 | FALSE | default |
1.0 | gpt-4.1-2025-04-14 | 4490.0 | 486.0 | FALSE | default |
Questions
-
Am I correct that this single call triggers both an image generation (
gpt-image-1
) and then a separate classification pass on gpt-4.1? -
How should I derive the input token counts for each model?
-
For image generation, I expected
total_tokens = input_image_tokens + input_text_tokens
, but my text tokens alone are higher than 470. -
For the
gpt-4.1
classification, even if the generated image is being passed back at high quality, I would expect much less than 4490 input tokens.
-
What am I misunderstanding or doing wrong when estimating these numbers? Any guidance on how the Responses API splits/attributes token usage between gpt-4.1
and gpt-image-1
—and how to reproduce the billing counts—would be much appreciated.