I’m testing out the Responses API for the first time and I can’t find any information on token input/output usage when using the image generation tool.
Model: gpt-4.1-mini-2025-04-14
Input: {role: 'user', content: 'Create a simple icon for GPT-4.1'}
revised_prompt: 'A simple and modern icon representing GPT-4.1, featuring the text "GPT-4.1" in a sleek, futuristic font. The icon should have a clean design
with a blue and white color scheme, incorporating subtle tech elements like circuit lines or digital nodes around the text. The background should be plain or gradient for a professional look.'
output_text: 'Here is a simple and modern icon for GPT-4.1 featuring a sleek design with circuit lines and a blue gradient background. Let me know if you want any adjustments!'
Image Output: {
quality: 'medium',
size: '1024x1024'
}
usage: {
input_tokens: 2294,
input_tokens_details: { cached_tokens: 0 },
output_tokens: 119,
output_tokens_details: { reasoning_tokens: 0 },
total_tokens: 2413
}
This doesn’t line up with the docs:
So the final cost is the sum of:
- input text tokens
- input image tokens if using the edits endpoint
- image output tokens
But I assume that’s specifically for the Image API.
It’s important for us to be able to track usage per user when calling the API. I was hoping for more detailed usage stats when it comes to image inputs/outputs. It seems the API doesn’t give detailed info on any multi-modal usage yet?
My best guess is that the usage shows original text and tool input tokens, plus the generated image and text being fed back into the text model. The revised prompt and text output are close enough to the output tokens.
Is the best bet in the meantime to estimate the gpt-image-1 usage, and determine how many input tokens it used for processing the image?