Massive Input Token Inflation (50x) with o3 + web_search

I’ve been working with the Responses API and have run into something that has me completely stumped. I’m seeing my input token counts balloon to massive numbers when using the o3 model with web search, and I’m hoping someone here might have some insight into what’s going on.

Here’s a breakdown of the specific API call:

  1. My Request: The request is part of an ongoing multi-turn conversation. The setup is as below:
  • The conversation history consists of 5 previous turns.

  • Those first 5 turns were handled by the gpt-4.1 model (with web_search enabled, no reasoning).

  • For the current prompt, I switched to the o3 model for the first time in this conversation and enabled web_search (and medium reasoning effort).

  • I’ve double-checked the complete input being sent in this call (the 5 previous turns + my new prompt + the system prompt) with the tokenizer, and it comes out to a clean 4,981 tokens.

2. Model’s Internal Steps: During the call, the o3 model made 14 reasoning calls and 13 web search calls to generate the answer. I totally get that reasoning tokens and the context from web searches will add to the input count, so I was expecting the final number to be higher.

3. The Usage Summary: This is the part that I can’t wrap my head around. Here’s what the usage summary looked like for this single call:

"usage": {
      "input_tokens": 243551,
      "input_tokens_details": {
        "cached_tokens": 194748
      },
      "output_tokens": 2911,
      "output_tokens_details": {
        "reasoning_tokens": 2048
      },
      "total_tokens": 246462
}

As you can see, the input tokens jumped to over 243k, which is almost 50 times my original prompt size. This wasn’t a one-off fluke either; I’ve seen this happen multiple times, with one call even hitting 650k input tokens.

  1. How is it possible for the input tokens to reach 243k? My prompt history was only ~5k and the reasoning tokens were ~2k. It seems the remaining ~236k tokens were added during the process, which feels incredibly high, especially since the prior turns with a different model were normal.
  2. I’m also trying to understand the cached_tokens. Seeing 194k tokens being cached is interesting, because all the previous turns amounted to less than 4k tokens.
  3. Could the ~236k extra tokens be coming from the 13 web searches? That would imply the tool is adding a huge amount of raw text from the web pages to the context, which I didn’t think was how it worked

Really scratching my head here. Token increase is expected with tools and multi-step reasoning, but a 50x inflation makes it very difficult to predict costs and use the model reliably.

Has anyone else experienced something similar? Any theories on what might be happening under the hood?

Thanks in advance

2 Likes

Remember that each of those returned web pages is being scraped and ingested by gpt as additional input context.

Then you have iteration towards a result with each call recapitulating your context.

The cache hit ratio suggests you are iterating many times over the same, or steadily expanding, context.

1 Like

@tom_osullivan Thanks a lot for chiming in! Thats a plausible explanation and connects the dots perfectly.

Are you sure thats how web_search actually works under the hood, or are you making an educated guess based on the token inflation numbers I posted? Either way, your logic is sound, but if that is the actual implementation, it feels incredibly inefficient and doesn’t make sense financially.

I was under the assumption that OpenAI perhaps uses an internal model to extract relevant snippets before adding them to the context. The idea that we are being billed for the ingestion of entire, raw web pages—repeatedly—is concerning. Paying for 243k for a single query, mostly composed of un-curated web content, would make it prohibitively expensive for me.

Just to add, the prompt that resulted in the massive token numbers is - “Can you give me latest news from the US for today?” :joy: