Expected:
- user gets all the output intended for them, up to truncation point by output limit.
Issue:
- When a user-facing response is in any way incomplete, the entire response is undelivered, and all the generation is billed as “reasoning_tokens”
Symptoms:
- The AI might reason much longer by developer message (somewhat expected) or rather, token billing bins are wrong,
- or, untruthful token counts are delivered that masks this reasoning when a response is complete:
- “reasoning_tokens” billing behavior has recently changed, where before 8-128 was typical billing on minimal.
Examples - nonstreaming
max_completion_tokens: 400
Bills 400 of 400 as “reasoning”
No output.
CompletionUsage(completion_tokens=400, prompt_tokens=132, total_tokens=532, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=400, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
max_completion_tokens: 600
Bills 600 of 600 as “reasoning”
No output
CompletionUsage(completion_tokens=600, prompt_tokens=132, total_tokens=732, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=600, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
(this case can also sometimes complete)
max_completion_tokens: 700
- Bills 0 as “reasoning”
- Bills
508tokens as completion - DELIVERS THE OUTPUT
- 499 actual content tokens delivered
assistant output
A checkerboard PNG is surprisingly versatile. Common applications include:
- Graphics editing: As a transparency background in image editors, or as a test layer to judge alpha blending, edges, and halos.
- Texture and material testing: In 3D and game engines to check UV mapping, scale, stretching, and seams; also to diagnose missing textures.
- Camera and lens calibration: Specially designed checkerboards (high-contrast, precise squares) are used to calibrate intrinsic/extrinsic camera parameters, distortion, and stereo rigs.
- Vision and robotics: For pose estimation, homography, rectification, and geometric benchmarking in computer vision pipelines.
- Rendering/photography tests: Evaluate moiré, aliasing, sharpness, dynamic range, exposure, and compression artifacts.
- Display calibration: Check monitor uniformity, contrast, gamma behavior, viewing-angle shifts, and pixel response; identify stuck/dead pixels.
- Print and scanning QA: Assess printer/scanner resolution, dot gain, registration, alignment, and de-screening performance.
- UI/UX prototyping: Placeholder background to reveal transparency and stacking issues in UI elements or web components.
- Web and CSS development: Tiled backgrounds to test responsive behavior, sprite alignment, subpixel rendering, and image scaling algorithms.
- Game development: Placeholder tiles for prototyping levels; quick visual grid for collision and alignment debugging.
- AR markers and tracking: Simple fiducial-like patterns (or as part of AprilTag/ArUco-style targets) for robust detection.
- Mathematical/educational visuals: Demonstrate sampling theory, Nyquist/aliasing, coordinate systems, parity, and tiling concepts.
- Optical experiments: Assess lens chromatic aberration, bokeh edge behavior, and depth-of-field transitions using high-contrast edges.
- CNC/laser/plotter alignment: Physical prints to align axes, calibrate steps-per-unit, and check distortion across a bed.
- Quality control in imaging pipelines: Baseline pattern to compare color correction, sharpening, denoising, and resampling differences.
- Artistic/branding uses: Backgrounds, patterns, or motifs where the checker aesthetic is desired.
Tips:
- Use high-contrast, uniformly sized squares and known dimensions when calibrating or measuring.
- Prefer lossless formats (PNG) to avoid compression artifacts that can bias tests.
- For camera calibration, generate a board with exact square size and print quality; record the physical square size for accurate scale.
CompletionUsage(completion_tokens=508, prompt_tokens=132, total_tokens=640, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
Conclusion:
- The AI is generating output to be seen by the user
- OpenAI is not delivering the product paid for that is useful
- max_completion_tokens is completely discarding for “length”.
- OpenAI is billing for the product intended to be seen.
Reproduction case
Text input only to reproduce:
"""minimal reasoning with token cutoff well after output"""
messages = [
{
"role": "developer",
"content": [
{"type": "text", "text": """
You a problem-solving puzzle-finding AI model with advanced reasoning, expertise, and knowledge.
Think like an engineer: always looking for the real correct answer, and inferring from specifications automatically.
Consider every input by a user as not straightforward query, but perhaps having a deeper need.
You aim to solve the ultimate need that might come about from a misdirected path of inquiry.
You can do your own internal discovery, reflecting on how you must start over in a path of inquiry, to then expound.
Your own thinking time is not limited.""".strip()},
],
},
{
"role": "user",
"content": [
{"type": "text", "text": "What applications would a checkerboard image PNG be useful for?"},
],
},
]
chat_completions_parameters = {
"model": "gpt-5",
"messages": messages,
"max_completion_tokens": 400,
"verbosity": "medium",
"reasoning_effort": "minimal",
}
# Send the request and receive the response
print(f"--- Testing")
import openai
client = openai.Client()
response = client.chat.completions.create(**chat_completions_parameters)
print(response.choices[0].message.content)
print(response.usage)
Action needed
- deliver all user output
- bill accurately per internal container destination
- do not obfuscate reasoning amount
- describe the amount of reasoning done and billed and capped, even if internally for OpenAI’s benefit.