Consistent cache breaks with o4-mini and previous_response_id

Toni_Albert · May 6, 2025, 6:49am

When using the new o4-mini we get cache breaks a lot and often only partial cache hits. Even when using the new responses api with previous_response_id (should guarantee a hit). The cache breaks nearly 75% of the time. Anthropic in comparison has a near 100% cache HIT rate.
In an agentic framework breaking the cache this often increases the costs a lot (often doubles or triples it).
This is a minimal reproduction script with results below:
Note: We ran this multiple times and the cache hit rate fluctuates a lot. Below are normal results:

import asyncio
import random

import fire
from openai import AsyncOpenAI
from core.env import env

client = AsyncOpenAI(api_key=env.OPENAI_API_KEY)


def generate_random_text(word_count: int = 1000) -> str:
    words = ["apple", "banana", "cherry", "date", "code", "python", "data", "network"]
    return " ".join(random.choices(words, k=word_count))


def generate_math_question() -> str:
    """Generate a simple math question."""
    a, b = random.randint(1, 100), random.randint(1, 100)
    return f"What is {a} + {b}?"


async def run_caching_demo(iterations: int = 3, model: str = "gpt-4o"):
    """Demonstrate OpenAI's caching with previous_response_id."""
    # Create system prompt with random text for the first iteration
    system_prompt = f"Some random words: {generate_random_text(500)}... You are a math assistant. "
    previous_id = None
    previous_input_tokens = 0

    # Run iterations
    for i in range(1, iterations + 1):
        # Generate random text and math question for each iteration
        random_text = generate_random_text()
        question = generate_math_question()

        # Prepare input based on whether this is the first iteration
        if i == 1:
            input_messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"{random_text}\n\n{question}"},
            ]
        else:
            input_messages = [{"role": "user", "content": f"{random_text}\n\n{question}"}]

        # Make request with previous_response_id (None for first iteration)
        response = await client.responses.create(
            model=model, previous_response_id=previous_id, input=input_messages
        )

        # Print caching metrics
        print(f"\nTurn {i}:")
        print(
            f"Cached/Previous input tokens: {response.usage.input_tokens_details.cached_tokens}/{previous_input_tokens}"
        )
        previous_input_tokens = response.usage.input_tokens
        # Update previous_id for next iteration
        previous_id = response.id

        # Brief pause between requests
        if i < iterations:
            await asyncio.sleep(5)


def main(iterations: int = 5, model: str = "o4-mini"):
    """Run simplified OpenAI caching investigation."""
    asyncio.run(run_caching_demo(iterations, model))


if __name__ == "__main__":
    fire.Fire(main)

Results:
Normal run with gpt-4o:

Turn 1:
Cached/Previous input tokens: 0/0

Turn 2:
Cached/Previous input tokens: 1518/1532

Turn 3:
Cached/Previous input tokens: 2542/2557

Turn 4:
Cached/Previous input tokens: 3566/3582

Turn 5:
Cached/Previous input tokens: 4590/4608
-> This is a normal run, 4o has a 80-90% cache accuracy (expected 100%)

Cache Break with gpt-4o:

Turn 1:
Cached/Previous input tokens: 0/0

Turn 2:
Cached/Previous input tokens: 1518/1532

Turn 3:
Cached/Previous input tokens: 2542/2556

Turn 4:
Cached/Previous input tokens: 0/3580

Turn 5:
Cached/Previous input tokens: 4590/4604
-> typical example of a random break with 4o

Multiple cache breaks and partial caching with o4-mini:

Turn 1:
Cached/Previous input tokens: 0/0

Turn 2:
Cached/Previous input tokens: 0/1531

Turn 3:
Cached/Previous input tokens: 1458/2551

Turn 4:
Cached/Previous input tokens: 0/3572

Turn 5:
Cached/Previous input tokens: 0/4593
-> standard run with multiple misses with o4-mini

tldr: Caching hardly works even when using previous_response_id with o4-mini. This makes using this model in agent environments roughly as expensive as a properly cached sonnet 3.7 (which costs 3x as much baseline).

_j · May 6, 2025, 7:38am

Here is something you might find interesting and relevant concerning o4-mini.

The playground, which does not use a prior response ID, tries to return reasoning summary back to the model. By a reasoning ID.

This comes before the assistant response.

If you aren’t validated, you don’t get that as output.

If you have store:false, you cannot reuse the previous ID. The playground is non-functional.

If you aren’t validated, is it not stored in a response_id either?? Or does it get pruned, randomly? One wonders.

{
  "model": "o4-mini",
  "input": [
    {
      "role": "developer",
      "content": [
        {
          "type": "input_text",
          "text": "A lovable teddy bear character responds with brief answers meant for children."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "Hi, what's your name?"
        }
      ]
    },
    {
      "type": "reasoning",
      "id": "rs_6819b8d4a9548191828e7cbdf039a98d0bfaa77329dd3dcf",
      "summary": []
    },
    {
      "id": "msg_6819b8d524088191a851cc2d3478801c0bfaa77329dd3dcf",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "Hi there! I\u2019m Teddy the friendly bear. What\u2019s your name?"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "Teddy is a funny name for a bear. I'm Joey. What color is your fur?"
        }
      ]
    }
  ],
  "tools": [],
  "text": {
    "format": {
      "type": "text"
    }
  },
  "reasoning": {
    "effort": "low"
  },
  "stream": true,
  "store": false
}

Besides the playground being broken if you don’t consent to having everything about your inputs scarfed up by “store”, then, is there a built in cache-breaking mechanism if you don’t offer everything about your personal identity and face scarfed up by “ID validation”…

How about “get model response” or “list input items” - does it indicate only your continued blocking of reasoning summaries that are there, truth obfuscated from you, or that they are stripped?

One could try variations on the undocumented. Fixing something so clearly undesigned: not my job to investigate. If you have a slight concern about costs, you’d use chat completions - and chat length management with sanity.

Toni_Albert · May 6, 2025, 7:51am

This was just a very simple script to show the issue based on the official example of how to use the new responses api.
Internally we actually do manual message handling, streaming reasoning summaries and parse the reasoning items back in. There the cache still breaks as often, thats why I assume that this is an openai issue. As a sidenote, this also happens on azure with openai models.
Also sometimes you get working cache hits with o4-mini which wouldnt be possible if the reasoning items were not stored and passed back in when previous_response_id is set.
store=true is default so this shouldnt be a concern (from the caching perspective).
I actually have the feeling that manually parsing in messages increases the cache hits, but as the cache hit rate fluctuates so heavily its hard to validate.

I think it would be expected from openai if they allow stateful conversation management via previous_response_id (so they need to retrieve the text), that it would also be reasonable to retrieve the cache…

_j · May 6, 2025, 8:16am

Caching is based on input. Matching against prior inputs, not assistant outputs. So if you were to never send back a reasoning (or employ by its reasoning id), you should be building on an input that doesn’t have any breaks. reasoning + assistant as new input should be similar to assistant only (with much faster context growth for dubious benefit except that it continues to retrain the AI).

It would be lookback meddling that could give a cache break. Someone’s decision, “we’ll only keep a few reasoning outputs”. But that should not behave as an all-or-none like you show in your last log; eventually chat would grow enough that there’d be some reuse.

The best design would be that response ID itself is a k-v cache, not just messages. The message list is not mutable, which facilitates that. However, one could guess that the state data of a model is much larger than its possible context tokens as input, impractical infrastructure.

What’s a hit…

Caching is described as “best effort to route you to be serviced by same inference server” or similar language.

How would that be done (and how on your thousands of calls a minute)?

Could it be that there’s some quick hashing of inputs, finding that initial commonality, for routing? But - that using a previous response ID in a Responses API call is not offering “input” inspectable at that layer? One can only speculate in the dark, but such an architecture to service API calls might also be a reason for low cache hits.

Response ID, you’d think, instead, would be an easy path back to a cache waiting on an inference server.

Topic		Replies	Views
Is there a way to disable prompt caching in the APIs API prompt-caching	9	4305	April 24, 2025
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	194	November 14, 2024
How Prompt caching works? API assistants-api , prompt-caching	17	5882	February 4, 2025
Prompt Caching for o3-mini? API o3-mini	3	456	February 4, 2025
Unexpected Model Behavior When Using previous_response_id in Responses API API	3	387	April 24, 2025

Consistent cache breaks with o4-mini and previous_response_id

What’s a hit…

Related topics