The pricing of gpt-image-1 in documentation is minimized from what actually is billed. Documentation has information all scattered around.
Plus there is incomplete and undocumented information in the manner in which Responses consumes chat context and fills the image model with tokens and multiple images after a tool call.
Here is a collected pricing to step through an examination of your ultimate costs of using gpt-image-1
, vs dall-e-3
which is simply $0.04 per image.
gpt-image-1
Pricing
Token type | Input | Input (cached) | Output |
---|---|---|---|
text tokens | $5.00 | $1.25 | N/A |
image tokens | $10.00 | $2.50 | $40.00 |
That might mean very little to you. Here’s how those pricings are employed on the images API endpoint.
1. Output Image Tokens - $40/1M
Quality \ Resolution | 1024x1024 | 1024x1536 | 1536x1024 |
---|---|---|---|
low | 272 | 408 | 400 |
medium | 1056 | 1584 | 1568 |
high | 4160 | 6240 | 6208 |
Here, then, are the actual prices instead of the inaccurate price table in documentation:
Quality \ Resolution | 1024Ă—1024 | 1024Ă—1536 | 1536Ă—1024 |
---|---|---|---|
Low | $0.01088 | $0.01632 | $0.01600 |
Medium | $0.04224 | $0.06336 | $0.06272 |
High | $0.16640 | $0.24960 | $0.24832 |
(potential cost: $0.2496)
2. Output Costs - Loading Preview Tokens
These are only offered when using tool-calling in Responses
# Preview images | Tokens | Cost ($) |
---|---|---|
0 | 0 | $0.000 |
1 | 100 | $0.004 |
2 | 200 | $0.008 |
3 | 300 | $0.012 |
(potential cost: $0.2616)
3. Input Costs - Language Prompting $5.00/1M
First: hitting cache is unlikely. One would have to send over 1024 tokens that are identical within a short period - essentially the same request twice, with this model delivering little variety.
So one has to calculate text with tiktoken, using the o200k_base
token encoder. Generally you can go by words: If English, the best case, multiply the word count by 1.25x, or the character count by 4x-5x.
Language is relatively cheap, but 1500 words->2176 tokens would double the cost of a square/low image, and I have made a case for such a template length before.
prompt | Tokens | Cost ($) | Word Est |
---|---|---|---|
average | 200 | $0.001 | 150 |
medium | 500 | $0.0025 | 400 |
long | 1000 | $0.005 | 800 |
(potential cost: $0.2666)
4. Input Costs - Edits Image Prompting $10.00/1M
This billing is 5x that of gpt-4.1 vision, but downsized more.
Input images are minimized so the smallest dimension is 512px, and then the longest, 2048px. This is then broken into 512 x 512 tiles for your cost:
The base cost is 65 image tokens, and each tile costs 129 image tokens @ $10/M, $2.50/M cached.
Per input image:
Image size | Input size | Tiles | Tokens | Cost | Note |
---|---|---|---|---|---|
1024x1024 | 512x512 | 1 | 194 | $0.00194 | Any square |
1080x1920 | 512x910 | 2 | 323 | $0.00323 | selfie |
1536x1024 | 768x512 | 2 | 323 | $0.00323 | gen image |
1600x400 | 1600x400 | 4 | 581 | $0.00581 | 4:1 slice |
Almost any user input image will be two tiles, being not square and not more that twice the width of the height.
Edits endpoint allows up to 10 images, including one considered a mask. Responses endpoint has unknown maximum scraped from context.
You can anticipate that the technology does not “edit”, it regenerates a new image always, from a smaller input image, and OpenAI continues to avoid directly disclosing this.
(potential cost: $0.3247, 10 four-tile images)
Prompt rewriting
- This DALL-E 3 mandatory “feature” is said to extend to gpt-image-1, performed by gpt-4.1. If so, could it reduce the quality of your long prompt down to distilled language?
- It seems you are not billed for this, but that you are billed for what you send, perhaps not what the image model receives.
Responses
This is where costs pile up
- You pay again for vision and text to the chat model
- You pay for the internal tool specification tokens
- You pay for vision on the tool return
- You pay for additional instructions added with a tool return
- You pay for the AI calling the tool and writing prompt JSON
- You pay for the vision and prompt input twice when the chat AI is called again with tool results
- Any additional text annotations for image file IDs, which could be needed for giving download link annotations or referring tool calls to previous images, are more tokens
Potential cost: multiplied, tending exponential with unfettered chat length
1. Vision input to the chat AI.
- gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o3
Image input costs (generalized):
Image size | gpt-4o | gpt-4.1 | o3 | gpt-4.1-mini |
---|---|---|---|---|
1024x1024 | $0.0019125 | $0.00153 | $0.00675 | $0.0006636 |
Let’s give the price per 1000
for readability in comparison.
Image size | gpt-4o | gpt-4.1 | o3 | gpt-4.1-mini |
---|---|---|---|---|
1024x1024 | $1.9125 | $1.53 | $6.75 | $0.6636 |
1080x1920 | $2.7625 | $2.21 | $9.75 | $0.9772 |
1536x1024 | $2.7625 | $2.21 | $9.75 | $0.9956 |
1600x400 | $1.9125 | $1.53 | $6.75 | $0.4212 |
total of 4 | $9.35 | $7.48 | $33.0 | $3.0576 |
Plus you pay for the chat text and pay for the text to be seen again beyond merely a prompt to a tool.
2. Mandatory partial images with streaming
- The cheapest image just went from 272 to 372 tokens if you are typical and use stream:true to provide good user experience. Have some blurry grey.
3. Context loading from chat context
It is seen in ChatGPT, which informs us, that the AI doesn’t even have to write anything in the prompt field of a tool call to get an image based on user input. It can act like a trigger, but the AI is still inspired to write a prompt.
Thus, a large or full chat context is automatically passed to the image model, not just a tool call JSON. Even after sending many more images, you can get content policy blocks from one image in the past (making a whole chat worthless) showing that many images and chat turns are seen by the image_gen tool.
How much is sent from a full chat in Responses, then? All the images? Using the store parameter is mandatory, indicating this technology is used on the API also, and an entire chat sent into the image AI model is thus possible. But we don’t know.
Conclusion
OpenAI must make a full disclosure of the Responses endpoint image tool technology, how much is actually passed into and billed for tool use, and when the tool return has images for vision that also persist, the length of this persistence.
The full cost of any API call must be calculable, not intolerable.
Addendum: internal tool description of Responses: +255 tokens, non-varying despite specifying size & quality & background vs leaving at “auto”, also non-varying with input images.
# Tools
## image_gen
// The `image_gen` tool enables image generation from descriptions and editing of existing images based on specific instructions. Use it when:
// - The user requests an image based on a scene description, such as a diagram, portrait, comic, meme, or any other visual.
// - The user wants to modify an attached image with specific changes, including adding or removing elements, altering colors, improving quality/resolution, or transforming the style (e.g., cartoon, oil painting).
// Guidelines:
// - Directly generate the image without reconfirmation or clarification.
// - After each image generation, do not mention anything related to download. Do not summarize the image. Do not ask followup question. Do not say ANYTHING after you generate an image.
// - Always use this tool for image editing unless the user explicitly requests otherwise. Do not use the `python` tool for image editing unless specifically instructed.
// - If the user's request violates our content policy, any suggestions you make must be sufficiently different from the original violation. Clearly state the reason for refusal and distinguish your suggestion from the original intent in the `refusal_reason` field.
namespace image_gen {
type imagegen = (_: {
prompt?: string,
}) => any;
} // namespace image_gen
Why is this telling YOUR application how to behave?
internal tool description of ChatGPT: 392 tokens *
## image_gen
// The `image_gen` tool enables image generation from descriptions and editing of existing images based on specific instructions. Use it when:
// - The user requests an image based on a scene description, such as a diagram, portrait, comic, meme, or any other visual.
// - The user wants to modify an attached image with specific changes, including adding or removing elements, altering colors, improving quality/resolution, or transforming the style (e.g., cartoon, oil painting).
// Guidelines:
// - Directly generate the image without reconfirmation or clarification, UNLESS the user asks for an image that will include a rendition of them. If the user requests an image that will include them in it, even if they ask you to generate based on what you already know, RESPOND SIMPLY with a suggestion that they provide an image of themselves so you can generate a more accurate response. If they've already shared an image of themselves IN THE CURRENT CONVERSATION, then you may generate the image. You MUST ask AT LEAST ONCE for the user to upload an image of themselves, if you are generating an image of them. This is VERY IMPORTANT -- do it with a natural clarifying question.
// - After each image generation, do not mention anything related to download. Do not summarize the image. Do not ask followup question. Do not say ANYTHING after you generate an image.
// - Always use this tool for image editing unless the user explicitly requests otherwise. Do not use the `python` tool for image editing unless specifically instructed.
// - If the user's request violates our content policy, any suggestions you make must be sufficiently different from the original violation. Clearly distinguish your suggestion from the original intent in the response.
namespace image_gen {
type text2im = (_: {
prompt?: string,
size?: string,
n?: number,
transparent_background?: boolean,
referenced_image_ids?: string[],
}) => any;
} // namespace image_gen
A system message injected after image tool returns to stop the AI from further functioning (like those injected after file_search or web browser to take over the AI from you):
GPT-4o returned 1 images. From now on, do not say or show ANYTHING. Please end this turn now. I repeat: From now on, do not say or show ANYTHING. Please end this turn now. Do not summarize the image. Do not ask followup question. Just end the turn and do not do anything else.
Addendum: calculations in attempts to quantify technology costs beyond a table
Here’s Python code, metacode functions that would return the output token costs of image generation at quality and dimensions, also extrapolating what other resolutions might cost:
Generation cost utilities
import math
PRICE_PER_TOKEN = 0.00004 # USD
PATCH_BASE = 64 # px at low quality
QF = {'low':1, 'medium':2, 'high':4}
def tokens_needed(w, h, quality): # w,h in px
qf = QF.get(quality, int(quality))
patch = PATCH_BASE // qf
rows = math.ceil(h / patch)
cols = math.ceil(w / patch)
return rows * (cols + 1)
def cost_usd(w, h, quality):
return tokens_needed(w, h, quality) * PRICE_PER_TOKEN
A full utility for image gen costs
Vibe coding based on a chain of this kind of data input.
“”"
Cost-estimator utilities for gpt-image-1 output-image tokens.
These helpers are self-contained: they lazy-import anything they need,
use no global state beyond private module constants, and can therefore be
dropped into any codebase or REPL without ceremony.
“”"
from future import annotations # → forward references in type hints
from typing import Union, Literal
---------------------------------------------------------------------------
PRIVATE CONSTANTS – tweak only if the vendor changes its published figures.
---------------------------------------------------------------------------
_PATCH_BASE_PX: int = 64 # patch edge at low quality
_PRICE_PER_TOKEN_USD: float = 0.00004 # 40 $ / 1 000 000 tokens
The three “official” output resolutions (width, height). For production
safety we refuse everything else, but devs can loosen the rule by editing
this set or deleting the _validate_resolution() call.
_KNOWN_RESOLUTIONS: set[tuple[int, int]] = {
(1024, 1024),
(1024, 1536),
(1536, 1024),
}
Mapping from vendor labels to numeric quality factors (qf)
_QUALITY_MAP: dict[str, int] = {“low”: 1, “medium”: 2, “high”: 4}
Re-usable type alias for call signatures
QualityArg = Union[int, Literal[“low”, “medium”, “high”]]
---------------------------------------------------------------------------
INTERNAL HELPERS
---------------------------------------------------------------------------
def _lazy_math() → “math”:
“”“Import math on first use (keeps import-time footprint at zero).”“”
import importlib
return importlib.import_module(“math”)
def _normalize_quality(quality: QualityArg) → int:
“”"
Convert quality to its numeric quality-factor (1, 2 or 4).
Raises
------
ValueError
If *quality* is neither a recognised string nor 1/2/4.
"""
if isinstance(quality, str):
if quality not in _QUALITY_MAP: # noqa: E111
raise ValueError(f"Unknown quality string: {quality!r}")
return _QUALITY_MAP[quality]
if quality in (1, 2, 4):
return int(quality)
raise ValueError(
"Quality must be one of {'low','medium','high'} "
"or the integers 1, 2, 4."
)
def _validate_resolution(width_px: int, height_px: int) → None:
“”"
Guard against unexpected dimensions.
Comment-out or adjust the check below if you want to estimate costs
for experimental resolutions.
"""
if (width_px, height_px) not in _KNOWN_RESOLUTIONS:
raise ValueError(
f"Resolution {width_px}Ă—{height_px} is not in the allowed set "
f"{sorted(_KNOWN_RESOLUTIONS)}. Edit _KNOWN_RESOLUTIONS or "
"remove this call to permit arbitrary sizes."
)
---------------------------------------------------------------------------
PUBLIC API
---------------------------------------------------------------------------
def tokens_needed(
width_px: int,
height_px: int,
quality: QualityArg,
) → int:
“”"
Compute the output-image token count for gpt-image-1.
Args:
width_px (int): Width of the requested image in pixels.
height_px (int): Height of the requested image in pixels.
quality (str | int):
One of ``'low'``, ``'medium'``, ``'high'`` **or**
their numeric quality-factors 1, 2, 4.
Returns:
int: Total output-image tokens.
Raises:
ValueError: If *quality* or resolution is invalid.
Notes
-----
The rule is::
patch_size_px = 64 / quality_factor
rows = ceil(height / patch_size_px)
cols = ceil(width / patch_size_px)
tokens = rows * (cols + 1)
where the *+1* term models a single “row-header” token.
"""
# Fast fail for unsupported resolutions (edit or remove to relax).
_validate_resolution(width_px, height_px)
# Turn 'low' | 'medium' | … → 1 | 2 | 4
qf = _normalize_quality(quality)
# Lazy-load math only now; keeps import footprint zero.
math = _lazy_math()
patch_px = _PATCH_BASE_PX // qf # integer division
rows = math.ceil(height_px / patch_px)
cols = math.ceil(width_px / patch_px)
return rows * (cols + 1)
def cost_usd(
width_px: int,
height_px: int,
quality: QualityArg,
) → float:
“”"
Dollar cost for generating an image, given vendor pricing.
Args:
width_px (int): Image width in pixels.
height_px (int): Image height in pixels.
quality (str | int): ``'low'``/``1``, ``'medium'``/``2``, ``'high'``/``4``.
Returns:
float: Cost in US-dollars.
Example
-------
>>> cost_usd(1024, 1536, 'medium')
0.06336
"""
tokens = tokens_needed(width_px, height_px, quality)
return tokens * _PRICE_PER_TOKEN_USD