I’ve been working with the Responses API and have run into something that has me completely stumped. I’m seeing my input token counts balloon to massive numbers when using the o3 model with web search, and I’m hoping someone here might have some insight into what’s going on.
Here’s a breakdown of the specific API call:
- My Request: The request is part of an ongoing multi-turn conversation. The setup is as below:
-
The conversation history consists of 5 previous turns.
-
Those first 5 turns were handled by the gpt-4.1 model (with web_search enabled, no reasoning).
-
For the current prompt, I switched to the o3 model for the first time in this conversation and enabled web_search (and medium reasoning effort).
-
I’ve double-checked the complete input being sent in this call (the 5 previous turns + my new prompt + the system prompt) with the tokenizer, and it comes out to a clean 4,981 tokens.
2. Model’s Internal Steps: During the call, the o3 model made 14 reasoning calls and 13 web search calls to generate the answer. I totally get that reasoning tokens and the context from web searches will add to the input count, so I was expecting the final number to be higher.
3. The Usage Summary: This is the part that I can’t wrap my head around. Here’s what the usage summary looked like for this single call:
"usage": {
"input_tokens": 243551,
"input_tokens_details": {
"cached_tokens": 194748
},
"output_tokens": 2911,
"output_tokens_details": {
"reasoning_tokens": 2048
},
"total_tokens": 246462
}
As you can see, the input tokens jumped to over 243k, which is almost 50 times my original prompt size. This wasn’t a one-off fluke either; I’ve seen this happen multiple times, with one call even hitting 650k input tokens.
- How is it possible for the input tokens to reach 243k? My prompt history was only ~5k and the reasoning tokens were ~2k. It seems the remaining ~236k tokens were added during the process, which feels incredibly high, especially since the prior turns with a different model were normal.
- I’m also trying to understand the cached_tokens. Seeing 194k tokens being cached is interesting, because all the previous turns amounted to less than 4k tokens.
- Could the ~236k extra tokens be coming from the 13 web searches? That would imply the tool is adding a huge amount of raw text from the web pages to the context, which I didn’t think was how it worked
Really scratching my head here. Token increase is expected with tools and multi-step reasoning, but a 50x inflation makes it very difficult to predict costs and use the model reliably.
Has anyone else experienced something similar? Any theories on what might be happening under the hood?
Thanks in advance