I’m using gpt-5.2-2025-12-11 via the Chat Completions API (v1/chat/completions) with max_completion_tokens: 300 on a narrative generation call. The expected output is a short paragraph around 105–135 visible output tokens.
My own database records response.usage.completion_tokens on every successful call.
The discrepancy is ~256k tokens that were billed but never recorded on my side. My narrative generator has a retry loop (MAX_RETRIES = 3). On failed attempts the code throws before writing to my database so failed attempts consume tokens on OpenAI but leave no trace on my side. I put a try and catch today so let’s see.
Now my question is what could possible generate such a huge output out of nowhere? My prompts are consistent and simple. Based on a chat with ChatGPT it says that reasoning tokens could be the issue. I understand that reasoning increases the cost but this is such a huge spike. Is it possible that someone the reasoning entered in some sort of loop inside itself? Even the failed ones have a consistent number of tokens ~128k. So maybe that is the reasoning limit? I could put reasoning_effort: 'none'but I’m more curios on the actual issue than limiting my app since I get good results with this flow except this issue.
Did you encounter something like this and if yes, what did you do to fix it?
Previous API use of Chat Completions with a too-low max_output_tokens to receive the entire response would give you no output:
OpenAI writes two days ago, “we have also flagged this to our engineering team and they are looking into this for you”
So the first consideration: that OpenAI might be changing behaviors to fix this non-delivery of truncated output, and that you should leave plenty of headroom so you always get any kind of response that is not a major recursive malfunction (such a nonstop loop of tokens).
However, I observe the bad behavior continues and we can thus somewhat discount “engineering delivering a fix”, as what should be partial output is still not received, only being “reasoning” in usage even with reasoning.effort set to none on your model in question:
input tokens: 168
output tokens: 100
uncached: 168
non-reasoning: 0
cached: 0
reasoning: 100
You say that you have an expected output of under 135 tokens, and your max budget is 300, but the usage log you show, through its entirety, is showing you being billed more than that as output tokens. This parameter should absolutely stop the billing at what is specified, in addition to the previous behavior that you’d get nothing out of the model. This fault may be that the amount of reasoning that OpenAI tries to mask in usage reports, that happens even with “none” usage is being billed, is showing up when calling the usage endpoint. That’s the first discrepancy I note.
The amount of output tokens you show in the top two API calls then is more than is allowed for the model in its entirety. 128000 (125k) is the model’s maximum output, which should include internal reasoning generation billed as output. You have calls that exceed that by 26 tokens, then 341 tokens. This is the second discrepancy of a billing being larger than what should be allowed.
Then the source of the fault itself. It would be extremely unlikely that you have precisely the amount of massive reasoning to get near the limit, and then the AI model is still able to transition to an output. If there is any malfunction that is actually related to token generation by a model and not a complete fault in the billing itself, it would be that the AI continues with repetitions and more output when it should have stopped. This is a symptom seen on the Responses API primarily, where Chat Completions seems largely immune to this behavior. However it could be a change where a stop sequence is not actually stopping the AI output after you’ve received it. That’s the first scenario.
The second scenario is simply that they messed up the billing calculations in recent model and API modifications.
So: it should be essentially impossible for you to get such a billing. In analysis, if you were to go to the platform site’s usage page, dollar figures of usage will be conflated with that many calls being made daily, but you should be able to immediately identify if your billing in the platform site’s usage for a day period has two calls with 500x the billing expected for output.
(Then another fault is in the “usage” site, after clicking “chat completions” to observe a daily bill in individual models there: OpenAI continues to not properly deliver the output tokens; only input token graphs are seen regardless of any setting in the site.)
Can you clarify: is the spreadsheet you show data that you have captured from the API call “usage” for every call, or is that a result of calling the organization admin API for usage with an admin API key? Where shall I discover the massive output tokens when I make my own calls? A call with max_output_tokens at 4000 got me my output product without excessive tokens in “usage”.
The export is from OpenAI Dashboard → Usage. I selected only the day that had the spike and export by minute. Then simply ordered by number of output tokens descending.
Lots of API calls to attempt to replicate the symptom could quickly confabulate the results if not limited to under one per minute. In either the export from “usage” or the “usage” admin API endpoint, one only gets “buckets”: at most 1 minute totals per model, and no request IDs.
So minimum effort replication: not immediate 100% fault.
I would capture and report the request ID, which is delivered in API call headers, to OpenAI for investigation of model and API behavior if this overbilling rears its head again.