Responses API: How to identify the exact underlying image generation model for precise internal billing?

I’m currently working on a SaaS application where I need to implement a precise internal billing system, deducting credits from my users based on their exact token consumption.

To get accurate usage metrics (especially to track prompt_cache_hit_tokens for input and exact completion_tokens for output), I migrated from the standard Image API (images/generations) to the new Responses API.

The integration works flawlessly, but I’ve hit a roadblock regarding cost calculation.

The documentation states: “The Responses API image generation tool uses its own GPT Image model selection.” While the Response payload correctly provides the token counts in the usage object, it doesn’t seem to expose the name of the underlying image model that was actually used.

Since output token pricing varies drastically between models, multiplying the completion_tokens by the right price is impossible without knowing the exact model.

My questions are:

  1. Is there a strict, documented mapping between the driver LLM and the image model? For instance, is it guaranteed that gpt-5-mini will always trigger gpt-image-1-mini, and gpt-5.5 will always trigger gpt-image-2?

  2. Is there a way to extract the exact image model name directly from the Response object payload? (I checked the SDK and couldn’t find a parameter for it).

  3. If this is currently a “black box”, are there any plans to expose the underlying image model in the usage block or metadata in future updates? Precise cost attribution is vital for developers building user-facing applications.

Thanks in advance for any insights or official clarifications!

The API reference for tools gives you that “auto” model selection is your choice, and you can select the specific model that shall be used.

It also states that gpt-image-1 is the default model, but the API reference does not yet even include gpt-image-2.

Costs are another matter entirely. An ImageGenerationCall output item or event does not provide any usage or model information, nor billing, nor the cost of image inputs that were used as image generation context automatically from the chat. Billing is at the image model’s cost, so wouldn’t be in “usage details”. By “chat” with an image model, you are basically not caring that a response could cost you $1.00+ if this tool is invoked by the AI, as there is obfuscated information about how it even works for collecting billable input context.

The API reference documentation page is a formatting mess currently. Here are parameters that can be passed to “tools” when you include image_generation tool, alphabetically, in an unfortunate wide table on this forum that can’t go wide, with a schema that is also reused for response output echo (in case some don’t seem like inputs).


Image Generation Tool

The image generation tool creates new images or edits existing images using GPT image models.

Use this tool by including a configuration with:

{
  "type": "image_generation"
}

Additional parameters may be provided to control the model, image size, quality, background, output format, editing behavior, and streaming previews.


Basic Example

{
  "type": "image_generation",
  "model": "gpt-image-1.5",
  "action": "generate",
  "size": "1024x1024",
  "quality": "high",
  "background": "auto",
  "output_format": "png"
}

Parameter Reference

Parameter Required Accepted Values Default Description
type Yes "image_generation" Identifies this as the image generation tool. This value must always be "image_generation".
action No "generate", "edit", "auto" "auto" Controls whether the tool should create a new image, edit an existing image, or automatically choose the appropriate behavior.
background No "transparent", "opaque", "auto" "auto" Controls whether the generated image should have a transparent background, an opaque background, or automatic background handling.
input_fidelity No "high", "low" "low" Controls how closely the output should preserve style, identity, and visual details from input images. Especially relevant for facial features and image edits.
input_image_mask No See Input Image Mask Provides a mask image for inpainting or targeted image editing.
model No "gpt-image-1", "gpt-image-1-mini", "gpt-image-1.5", or another supported model name as a string "gpt-image-1" Selects the image generation model.
moderation No "auto", "low" "auto" Controls the moderation strictness applied to generated images.
output_compression No Number 100 Controls output image compression. Mainly relevant for compressed formats such as JPEG or WebP.
output_format No "png", "webp", "jpeg" "png" Sets the file format for the generated image.
partial_images No Number from 0 to 3 0 Controls how many partial image previews are produced while streaming. Use 0 to disable partial images.
quality No "low", "medium", "high", "auto" "auto" Controls the quality level of the generated image. Higher quality may increase generation time or cost.
size No "1024x1024", "1024x1536", "1536x1024", "auto" "auto" Sets the output image dimensions. Use "auto" to let the system choose.

Parameters in Detail

type

Identifies the tool configuration as an image generation request.

Required value:

"type": "image_generation"

This parameter is required.


action

Controls whether the tool generates a new image, edits an existing image, or decides automatically.

Accepted values:

Value Meaning
"generate" Create a new image from the prompt or instructions.
"edit" Modify an existing input image.
"auto" Let the system choose between generation and editing behavior.

Default:

"auto"

background

Controls the background style of the generated image.

Accepted values:

Value Meaning
"transparent" Generate an image with transparency where supported.
"opaque" Generate an image with a solid, non-transparent background.
"auto" Let the system choose the appropriate background handling.

Default:

"auto"

input_fidelity

Controls how closely the output should preserve details from supplied input images.

Accepted values:

Value Meaning
"high" Stronger preservation of input image details, style, and features. Useful when editing faces, likenesses, or specific visual identities.
"low" Looser preservation of input details. Allows more variation from the input image.

Default:

"low"

Model support:

  • Supported by gpt-image-1
  • Supported by gpt-image-1.5
  • Not supported by gpt-image-1-mini

input_image_mask

Provides a mask image for inpainting or targeted editing.

The mask can be supplied either as a file ID or as a base64-encoded image string.

Example using a file ID:

{
  "input_image_mask": {
    "file_id": "file_abc123"
  }
}

Example using a base64-encoded image:

{
  "input_image_mask": {
    "image_url": "data:image/png;base64,..."
  }
}

Subfields:

Field Required Accepted Values Description
file_id No String ID of a previously uploaded mask image file.
image_url No String Base64-encoded mask image.

Notes:

  • Use input_image_mask when only part of an image should be edited.
  • The mask identifies the area to modify during inpainting.
  • At least one of file_id or image_url should be provided when using a mask.

model

Selects the image generation model.

Accepted values include:

Value Description
"gpt-image-1" Default image generation model.
"gpt-image-1-mini" Smaller image generation model. Some advanced features may not be supported.
"gpt-image-1.5" Newer image generation model with support for advanced image features.
Any other supported model name as a string Allows specifying another compatible image model if available.

Default:

"gpt-image-1"

moderation

Controls the moderation level used for image generation.

Accepted values:

Value Meaning
"auto" Use the default moderation behavior.
"low" Use a lower moderation level where available.

Default:

"auto"

output_compression

Controls the compression level of the output image.

Accepted value:

Number

Default:

100

Notes:

  • This is most relevant for compressed output formats such as "jpeg" and "webp".
  • Higher values generally mean less compression and higher image quality.
  • Lower values generally mean more compression and smaller file size.

output_format

Sets the output image file format.

Accepted values:

Value Meaning
"png" PNG image output. Useful for lossless images and transparency.
"webp" WebP image output. Useful for compressed web images.
"jpeg" JPEG image output. Useful for photographs and compressed images without transparency.

Default:

"png"

partial_images

Controls how many partial images are generated while streaming.

Accepted value:

0, 1, 2, or 3

Default:

0

Meaning:

Value Meaning
0 Do not generate partial image previews.
1 Generate one partial image preview.
2 Generate two partial image previews.
3 Generate three partial image previews.

quality

Controls the image quality level.

Accepted values:

Value Meaning
"low" Lower quality, generally faster or less expensive.
"medium" Balanced quality.
"high" Higher quality, generally slower or more expensive.
"auto" Let the system choose the appropriate quality level.

Default:

"auto"

size

Sets the output image dimensions.

Accepted values:

Value Orientation
"1024x1024" Square
"1024x1536" Portrait
"1536x1024" Landscape
"auto" Automatically selected

Default:

"auto"

Compact Example: Generate a Square PNG

{
  "type": "image_generation",
  "action": "generate",
  "model": "gpt-image-1",
  "size": "1024x1024",
  "quality": "auto",
  "output_format": "png"
}

Compact Example: Edit an Image With High Input Fidelity

{
  "type": "image_generation",
  "action": "edit",
  "model": "gpt-image-1.5",
  "input_fidelity": "high",
  "size": "1024x1024",
  "quality": "high",
  "output_format": "png"
}

Compact Example: Use a Mask for Inpainting

{
  "type": "image_generation",
  "action": "edit",
  "model": "gpt-image-1.5",
  "input_image_mask": {
    "file_id": "file_abc123"
  },
  "size": "1024x1024",
  "quality": "high",
  "output_format": "png"
}

Hi _j,

Thank you so much for the incredibly detailed response. It completely clarifies the situation, even if it confirms my worst fears.

The fact that the Responses API silently passes the entire chat context to the image model—and bills for those vision/input tokens without explicitly reporting them anywhere in the usage object is a massive dealbreaker .

Do you (or anyone else closely following the developer updates) foresee OpenAI updating this? Is there any buzz or roadmap indicating that they will eventually expose the exact, itemized tool invocation costs directly in the API response payload?

Thanks again.

The main thing here: OpenAI gives a “cached” price for the image model, yet never delivers a cached discount on the generate or edits images API.

It is just as likely they never give a discount when images are generated by the Responses API tool when you “chat with an image buddy”, your reasoning for exploring Responses. You have such poor auditing and apparent obfuscation available, you’d have to set up a separate project just to make some deterministic sequence of calls to even find costs to report an error.

The logical place to return the usage would be in the internal tool call event you have returned as output, just as you can get the code the AI wrote. If streaming and live, you can monitor that tool call result and say, “that’s enough money spent on images for you”.

I don’t foresee any change. There’s no “bug report” format being heard here, where you can say, “we must have costs in order to even consider this product, so we can bill, for what you give to free ChatGPT users”.

No tool tells you how much it is costing you, whether you got charged more “code” in 20 minute increments for containers, vector store token placement individually, now vector store search fees, internet search fees, etc. It is hard to think this opacity is extended oversight - they did the same on Assistants when its costs for retrieval were astronomical, even degrading the usage page the same day of release.

The thing you will note at least is that you have some semblance of control over what the image output might be or model you specify. However, that doesn’t work for logical constraints within reasonable variation (such as varying aspect ratio) for arbitrary “chat” images by talking to the AI. You’d need a bunch of user interface controls or just one image type only. Your imagination is limited to their imagination.