Responses API high token consumption

Hello All,

Did anyone experience high token consumption when using the Responses API with gpt-4.1-mini?

In my case, I have a prompt which is 1716 tokens according OpenAI tokenizer (see screenshot below)

But the Response API is charging me with 9708 tokens (with 5578 of them cached)

"usage": {
    "input_tokens": 9708,
    "total_tokens": 9875,
    "output_tokens": 167,
    "input_tokens_details": {
      "cached_tokens": 5578
    },
    "output_tokens_details": {
      "reasoning_tokens": 0
    }
  }

Any idea why there is so a big difference between 1716 vs 9708 tokens?

That little snippet in your first image says it all:

file_search_call

You are passing the file_search results to the prompt, which upon processing expands the prompt significantly.

Good catch @mat.eo!

I just removed file_search from the tools list, and still get 11000 tokens.

In my initial example, there was no file_search_call in the output. See the list of output items I got. The number is the message length in characters.

message 149
message 156
message 157
message 164
function_call check_delivery_status

You’re using function calling, which also expands the prompt.

All features outside of the system message and prompt use additional tokens. It’s not accurate to copy and paste the code that prepare your request into a tokenizer.

  1. The tokens counted isn’t the JSON of a request like you pasted into the tokenizer. It is the language seen by AI.
  2. The cached tokens indicates you are reusing a previous ID and previous chat state, where you pay for all the chat turns seen before, including the tool calls.

Reset the chat. See the token count drop.

Do not use the Responses’ server-side state by response ID reuse, unless you like paying for massive chat cost run-up possible on 1M context input models. OpenAI gives something unsuitable for untrusted users who aren’t paying the bill - manage your own chat history length in coordination with the caching discount and expiry period.

How can I reset the chat?
By the way, I am not storing responses on the server. I manually manage chat history.
I have the following in my API call payload:

  "previous_response_id": null,
  "store": false,

My guess for these extra tokens is that responses API “thinks” about:

  • should it use file_search
  • should it call some function
  • something else???
    and that “thinking” eats input tokens.

Reset, meaning just abandon the chat session, don’t reuse any previous response ID, equivalent to “start a new chat” as a user button (or what the user must do if you enforce a maximum number of turns as your expected service).

If you are in control of all the messages you send, by not reusing a previous ID, then you likely are in complete observation and control of everything you are resending as input messages.

All the input you send is billed again, even if continuing a chat session. To reduce your costs, you reduce the length of history you preserve.

If you are making quick successive calls, at actual chat speed, then you may receive a cache discount of 50% or 75% on that part of input from the start that was identical to a previous API call’s input.

a little out of topic but, in his case what is the correct way of getting uncached input tokens? is it “input_tokens”: 9708 - “cached_tokens”: 5578 = 4130 uncached input tokens? and to calculate the price of that response you multiply 4130 by uncached token price + 5578 cached token price + 167 output token price? why is it that the api doesnt naturally return the uncached token amounts or I am calculating wrong? thanks

Subtraction.

Note in this function, where uncached is defined internally by:
f"uncached: {(uncached := total_prompt_tokens - cached_prompt_tokens)}",

Responses API has renamed fields vs Chat Completions, so the function would be working on Responses “usage” input values remapped:

input_tokens - cached_tokens

def pretty_usage_table(usage_data: dict, one_line=False) -> str:
    '''Returns printable multi-line table string with usage, optionally a single line
Compatible with Responses or Chat Completions API; prints only useful information

Chat Completions API received usage example:
{
   "completion_tokens": 75,
   "prompt_tokens": 1289,
   "total_tokens": 1364,
   "completion_tokens_details": {
      "audio_tokens": 0,               # portion billed at the higher rate of audio
      "reasoning_tokens": 64,          # portion billed that was unseen reasoning
      "accepted_prediction_tokens": 0, # informational, portion of "prediction" input matched
      "rejected_prediction_tokens": 0  # billed unmatched input, not exclusive of `accepted` and can total more than sent
   },
   "prompt_tokens_details": {
      "audio_tokens": 0,     # portion billed at the higher rate of audio
      "cached_tokens": 1152  # portion discounted by matching prior context window
   }
}

Responses API received usage example:
{
   "input_tokens": 1289,
   "input_tokens_details": {
      "cached_tokens": 0       # portion discounted by matching prior context window
   },
   "output_tokens": 685,
   "output_tokens_details": {
      "reasoning_tokens": 640  # portion billed that was unseen reasoning
   },
   "total_tokens": 1974
}

'''
    import json
    print(f"\nreceived usage:\n{json.dumps(usage_data, indent=3)}\n")  # can comment out after debugging
    
    # process any usage object input to chat completions form
    normalized_usage = {
        key.replace("input_", "prompt_").replace("output_", "completion_"): value
        for key, value in usage_data.items()
    }

    # Totals and detail breakdowns
    total_prompt_tokens = normalized_usage.get("prompt_tokens", 0)
    total_completion_tokens = normalized_usage.get("completion_tokens", 0)

    prompt_detail = normalized_usage.get("prompt_tokens_details", {})
    completion_detail = normalized_usage.get("completion_tokens_details", {})

    cached_prompt_tokens = prompt_detail.get("cached_tokens", 0)
    audio_prompt_tokens = prompt_detail.get("audio_tokens", 0)

    reasoning_completion_tokens = completion_detail.get("reasoning_tokens", 0)
    audio_completion_tokens = completion_detail.get("audio_tokens", 0)

    # Prepare columns with intermediate assignments via walrus
    prompt_column: list[str] = [
        f"input tokens: {total_prompt_tokens}",
        f"uncached: {(uncached := total_prompt_tokens - cached_prompt_tokens)}",
        f"cached: {cached_prompt_tokens}",
    ]
    completion_column: list[str] = [
        f"output tokens: {total_completion_tokens}",
        f"non-reasoning: {(nonreasoning := total_completion_tokens - reasoning_completion_tokens)}",
        f"reasoning: {reasoning_completion_tokens}",
    ]

    # Include audio breakdown if present
    if audio_prompt_tokens or audio_completion_tokens:
        prompt_column.append(f"non-audio: {(nonaudio_prompt := total_prompt_tokens - audio_prompt_tokens)}")
        prompt_column.append(f"audio: {audio_prompt_tokens}")
        completion_column.append(f"non-audio: {(nonaudio_completion := total_completion_tokens - audio_completion_tokens)}")
        completion_column.append(f"audio: {audio_completion_tokens}")

    # Determine column widths
    prompt_width = max(len(cell) for cell in prompt_column)
    completion_width = max(len(cell) for cell in completion_column)

    # Build table lines
    table_lines: list[str] = []
    table_lines.append(f"| {'-' * prompt_width} | {'-' * completion_width} |")
    table_lines.append(f"| {prompt_column[0].ljust(prompt_width)} | {completion_column[0].ljust(completion_width)} |")
    table_lines.append(f"| {'-' * prompt_width} | {'-' * completion_width} |")
    for left_cell, right_cell in zip(prompt_column[1:], completion_column[1:]):
        table_lines.append(f"| {left_cell.ljust(prompt_width)} | {right_cell.ljust(completion_width)} |")

    # One-line summary uses the earlier assignments
    prompt_audio_str = f", audio {audio_prompt_tokens}" if audio_prompt_tokens else ""
    completion_audio_str = f", audio {audio_completion_tokens}" if audio_completion_tokens else ""
    single_line = (
        f"input: {total_prompt_tokens} (uncached {uncached}, cached {cached_prompt_tokens}{prompt_audio_str}); "
        f"output: {total_completion_tokens} (non-reasoning {nonreasoning}, "
        f"reasoning {reasoning_completion_tokens}{completion_audio_str})"
    )

    return "\n" + single_line if one_line else "\n" + "\n".join(table_lines)

Near the top, I added a print line so you can see the usage input to the function.

The function delivers a string to print; it also works as documentation.

| ------------------ | ----------------- |
| input tokens: 1289 | output tokens: 75 |
| ------------------ | ----------------- |
| uncached: 137      | non-reasoning: 11 |
| cached: 1152       | reasoning: 64     |

You could simply write a new code function that adds the extra values to “usage” that might be useful. GPT-5 thinking in ChatGPT I discover is too idiotic and non-functional to follow instructions to do so based on this function as API documentation and the exact input and output of the function needed; o3 also bad, creating output incompatible with the input. Here’s Claude Sonnet in 0-shot enhancing your usage dict for you:

def enhanced_usage_dict(usage: dict) -> dict:
    """Add calculated fields to OpenAI usage objects from Chat Completions or Responses API.
    
    Accepts usage dict from either API format and returns the same dict with additional
    calculated fields added to the details objects and top level as appropriate.
    
    Args:
        usage: Usage dict extracted from API response (use .model_dump() if using OpenAI SDK)
    
    Returns:
        Enhanced usage dict with additional calculated fields
    
    Raises:
        ValueError: If input is not a dict
    
    Chat Completions API usage example:
    {
       "completion_tokens": 75,
       "prompt_tokens": 1289,
       "total_tokens": 1364,
       "completion_tokens_details": {
          "audio_tokens": 0,
          "reasoning_tokens": 64,
          "accepted_prediction_tokens": 0,
          "rejected_prediction_tokens": 0
       },
       "prompt_tokens_details": {
          "audio_tokens": 0,
          "cached_tokens": 1152
       }
    }
    
    Responses API usage example:
    {
       "input_tokens": 1289,
       "input_tokens_details": {
          "cached_tokens": 0
       },
       "output_tokens": 685,
       "output_tokens_details": {
          "reasoning_tokens": 640
       },
       "total_tokens": 1974
    }
    """
    # Validate input type
    if not isinstance(usage, dict):
        raise ValueError("dict input required; try `response.usage.model_dump()`")
    
    # Create a copy to avoid mutating the original
    enhanced = usage.copy()
    
    # Detect API format and normalize field names
    is_responses_api = "input_tokens" in usage or "output_tokens" in usage
    
    # Get base token counts
    if is_responses_api:
        total_input = enhanced.get("input_tokens", 0)
        total_output = enhanced.get("output_tokens", 0)
        input_details_key = "input_tokens_details"
        output_details_key = "output_tokens_details"
    else:
        total_input = enhanced.get("prompt_tokens", 0)
        total_output = enhanced.get("completion_tokens", 0)
        input_details_key = "prompt_tokens_details"
        output_details_key = "completion_tokens_details"
    
    # Ensure details dicts exist
    if input_details_key not in enhanced:
        enhanced[input_details_key] = {}
    if output_details_key not in enhanced:
        enhanced[output_details_key] = {}
    
    input_details = enhanced[input_details_key]
    output_details = enhanced[output_details_key]
    
    # Calculate and add input/prompt token breakdowns
    cached_tokens = input_details.get("cached_tokens", 0)
    audio_input_tokens = input_details.get("audio_tokens", 0)
    
    input_details["uncached_tokens"] = total_input - cached_tokens
    
    if audio_input_tokens > 0:
        input_details["non_audio_tokens"] = total_input - audio_input_tokens
    
    # Calculate and add output/completion token breakdowns
    reasoning_tokens = output_details.get("reasoning_tokens", 0)
    audio_output_tokens = output_details.get("audio_tokens", 0)
    
    output_details["non_reasoning_tokens"] = total_output - reasoning_tokens
    
    if audio_output_tokens > 0:
        output_details["non_audio_tokens"] = total_output - audio_output_tokens
    
    return enhanced

Then you are provided a dict with additional fields:

{
   "input_tokens": 1289,
   "input_tokens_details": {
      "cached_tokens": 0,
      "uncached_tokens": 1289  # NEW: 1289 - 0
   },
   "output_tokens": 685,
   "output_tokens_details": {
      "reasoning_tokens": 640,
      "non_reasoning_tokens": 45  # NEW: 685 - 640
   },
   "total_tokens": 1974
}