GPT-5 on "minimal" - Serious anomaly in prediction token billing and output delivery failure

Expected:

  • user gets all the output intended for them, up to truncation point by output limit.

Issue:

  • When a user-facing response is in any way incomplete, the entire response is undelivered, and all the generation is billed as “reasoning_tokens”

Symptoms:

  • The AI might reason much longer by developer message (somewhat expected) or rather, token billing bins are wrong,
  • or, untruthful token counts are delivered that masks this reasoning when a response is complete:
  • “reasoning_tokens” billing behavior has recently changed, where before 8-128 was typical billing on minimal.

Examples - nonstreaming

max_completion_tokens: 400

Bills 400 of 400 as “reasoning”
No output.

CompletionUsage(completion_tokens=400, prompt_tokens=132, total_tokens=532, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=400, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))

max_completion_tokens: 600

Bills 600 of 600 as “reasoning”
No output

CompletionUsage(completion_tokens=600, prompt_tokens=132, total_tokens=732, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=600, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))

(this case can also sometimes complete)

max_completion_tokens: 700

  • Bills 0 as “reasoning”
  • Bills 508 tokens as completion
  • DELIVERS THE OUTPUT
  • 499 actual content tokens delivered
assistant output

A checkerboard PNG is surprisingly versatile. Common applications include:

  • Graphics editing: As a transparency background in image editors, or as a test layer to judge alpha blending, edges, and halos.
  • Texture and material testing: In 3D and game engines to check UV mapping, scale, stretching, and seams; also to diagnose missing textures.
  • Camera and lens calibration: Specially designed checkerboards (high-contrast, precise squares) are used to calibrate intrinsic/extrinsic camera parameters, distortion, and stereo rigs.
  • Vision and robotics: For pose estimation, homography, rectification, and geometric benchmarking in computer vision pipelines.
  • Rendering/photography tests: Evaluate moiré, aliasing, sharpness, dynamic range, exposure, and compression artifacts.
  • Display calibration: Check monitor uniformity, contrast, gamma behavior, viewing-angle shifts, and pixel response; identify stuck/dead pixels.
  • Print and scanning QA: Assess printer/scanner resolution, dot gain, registration, alignment, and de-screening performance.
  • UI/UX prototyping: Placeholder background to reveal transparency and stacking issues in UI elements or web components.
  • Web and CSS development: Tiled backgrounds to test responsive behavior, sprite alignment, subpixel rendering, and image scaling algorithms.
  • Game development: Placeholder tiles for prototyping levels; quick visual grid for collision and alignment debugging.
  • AR markers and tracking: Simple fiducial-like patterns (or as part of AprilTag/ArUco-style targets) for robust detection.
  • Mathematical/educational visuals: Demonstrate sampling theory, Nyquist/aliasing, coordinate systems, parity, and tiling concepts.
  • Optical experiments: Assess lens chromatic aberration, bokeh edge behavior, and depth-of-field transitions using high-contrast edges.
  • CNC/laser/plotter alignment: Physical prints to align axes, calibrate steps-per-unit, and check distortion across a bed.
  • Quality control in imaging pipelines: Baseline pattern to compare color correction, sharpening, denoising, and resampling differences.
  • Artistic/branding uses: Backgrounds, patterns, or motifs where the checker aesthetic is desired.

Tips:

  • Use high-contrast, uniformly sized squares and known dimensions when calibrating or measuring.
  • Prefer lossless formats (PNG) to avoid compression artifacts that can bias tests.
  • For camera calibration, generate a board with exact square size and print quality; record the physical square size for accurate scale.

CompletionUsage(completion_tokens=508, prompt_tokens=132, total_tokens=640, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))

Conclusion:

  • The AI is generating output to be seen by the user
  • OpenAI is not delivering the product paid for that is useful
  • max_completion_tokens is completely discarding for “length”.
  • OpenAI is billing for the product intended to be seen.

Reproduction case

Text input only to reproduce:

"""minimal reasoning with token cutoff well after output"""
messages = [
    {
        "role": "developer",
        "content": [
            {"type": "text", "text": """
You a problem-solving puzzle-finding AI model with advanced reasoning, expertise, and knowledge.
Think like an engineer: always looking for the real correct answer, and inferring from specifications automatically.
Consider every input by a user as not straightforward query, but perhaps having a deeper need.
You aim to solve the ultimate need that might come about from a misdirected path of inquiry.
You can do your own internal discovery, reflecting on how you must start over in a path of inquiry, to then expound.
Your own thinking time is not limited.""".strip()},
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What applications would a checkerboard image PNG be useful for?"},
        ],
    },
]
chat_completions_parameters = {
    "model": "gpt-5",
    "messages": messages,
    "max_completion_tokens": 400,
    "verbosity": "medium",
    "reasoning_effort": "minimal",
}

# Send the request and receive the response
print(f"--- Testing")
import openai
client = openai.Client()
response = client.chat.completions.create(**chat_completions_parameters)
print(response.choices[0].message.content)
print(response.usage)

Action needed

  • deliver all user output
  • bill accurately per internal container destination
  • do not obfuscate reasoning amount
  • describe the amount of reasoning done and billed and capped, even if internally for OpenAI’s benefit.

I think even on “minimal” it will often use reasoning with that prompt (I’ve done similar tests with puzzles). But yes something is off with your case.
It would be great to have a “thinking budget tokens” parameter like with Claude, instead of the unified max_completion_tokens
PS: try with something that requires a very short answer like yes/no or a value, it will be easier to see that it’s reasoning even on minimal, cause sometimes it will stay 30 seconds to give back an empty answer → finished token budget with reasoning (my interpretation)