Responses API Image Generation Token Usage

I’m testing out the Responses API for the first time and I can’t find any information on token input/output usage when using the image generation tool.

Model: gpt-4.1-mini-2025-04-14
Input: {role: 'user', content: 'Create a simple icon for GPT-4.1'}

revised_prompt: 'A simple and modern icon representing GPT-4.1, featuring the text "GPT-4.1" in a sleek, futuristic font. The icon should have a clean design 
with a blue and white color scheme, incorporating subtle tech elements like circuit lines or digital nodes around the text. The background should be plain or gradient for a professional look.'

output_text: 'Here is a simple and modern icon for GPT-4.1 featuring a sleek design with circuit lines and a blue gradient background. Let me know if you want any adjustments!'

Image Output: {
  quality: 'medium',
  size: '1024x1024'
}

usage: {
    input_tokens: 2294,
    input_tokens_details: { cached_tokens: 0 },
    output_tokens: 119,
    output_tokens_details: { reasoning_tokens: 0 },
    total_tokens: 2413
}

This doesn’t line up with the docs:

So the final cost is the sum of:

  • input text tokens
  • input image tokens if using the edits endpoint
  • image output tokens

But I assume that’s specifically for the Image API.


It’s important for us to be able to track usage per user when calling the API. I was hoping for more detailed usage stats when it comes to image inputs/outputs. It seems the API doesn’t give detailed info on any multi-modal usage yet?

My best guess is that the usage shows original text and tool input tokens, plus the generated image and text being fed back into the text model. The revised prompt and text output are close enough to the output tokens.

Is the best bet in the meantime to estimate the gpt-image-1 usage, and determine how many input tokens it used for processing the image?

I have made a topic about this, highlighting the non-disclosure of how the technology actually works - not an independent tool with a prompt, but rather, receiving in context from the chat with past images, that also cost in vision that continues.

Thing is there: the image tokens of the gpt-image-1 model are not what you are seeing in the usage, the image tool is billed at a different rate.

I hope to classify the usage and thresholds a bit more, but such work could be made obsolete by proper documentation, rather than “try this out…and pay”.


Hot tip:

Save some vision expense when passing in images: resize so the shorter dimension is 512 pixels, or that the longer dimension is maximum 1024 pixels. The first is the maximum internal resize to the image creation model, the second can save you expense of not going over two “tiles” of 512px on recurring vision.


Estimating basics

A normal “hello” call with lowest max_output_tokens allowed, 16.

['Hello! How can I assist you today?']
{
  "input_tokens": 10,
  "input_tokens_details": {
    "cached_tokens": 0
  },
  "output_tokens": 10,
  "output_tokens_details": {
    "reasoning_tokens": 0
  },
  "total_tokens": 20
}

With the addition of tools=[{"type": "image_generation"}], the increase per API iteration just from the tool specification language added:

{
  "input_tokens": 265,
  "input_tokens_details": {
    "cached_tokens": 0
  },
  "output_tokens": 11,
  "output_tokens_details": {
    "reasoning_tokens": 0
  },
  "total_tokens": 276
}

With tool invocation, you pay for the input (at least) twice, the second time, with the output billed as input, the new tool response, and what the AI again writes.

The absolute minimum-cost image - quality:low, size:1024x1024, 14-token prompt only.

  • The cost for NOT the image, just gpt-4o.

“Create the OpenAI logo. Just say ‘Done’ when complete.”

['Done.']
{
  "input_tokens": 674,
  "input_tokens_details": {
    "cached_tokens": 0
  },
  "output_tokens": 33,
  "output_tokens_details": {
    "reasoning_tokens": 0
  },
  "total_tokens": 707
}

Useless Usage

This API response object does not report image tokens, and I’m making direct RESTful calls to get usage unfiltered.

The tool defaults to auto quality (if you can imagine) and auto size, giving unpredictable costs.

Then one has to separately obtain, many clicks into the Usage page (with the legacy usage now gone) independently looking at seven different categories of “cached input tokens”, “model requests”.. under chat completions, to only see the input tokens anywhere.

gpt-image-1-2025-04-23
input: 23 tokens (reflects receiving the 33 output minus “Done.”)
output: (should be $0.01088 = 272 tokens, not yet in usage)

Resulting in

There is no finding image output cost anywhere other than money “total spend”, and nowhere in tokens.