When using the new o4-mini we get cache breaks a lot and often only partial cache hits. Even when using the new responses api with previous_response_id (should guarantee a hit). The cache breaks nearly 75% of the time. Anthropic in comparison has a near 100% cache HIT rate.
In an agentic framework breaking the cache this often increases the costs a lot (often doubles or triples it).
This is a minimal reproduction script with results below:
Note: We ran this multiple times and the cache hit rate fluctuates a lot. Below are normal results:
import asyncio
import random
import fire
from openai import AsyncOpenAI
from core.env import env
client = AsyncOpenAI(api_key=env.OPENAI_API_KEY)
def generate_random_text(word_count: int = 1000) -> str:
words = ["apple", "banana", "cherry", "date", "code", "python", "data", "network"]
return " ".join(random.choices(words, k=word_count))
def generate_math_question() -> str:
"""Generate a simple math question."""
a, b = random.randint(1, 100), random.randint(1, 100)
return f"What is {a} + {b}?"
async def run_caching_demo(iterations: int = 3, model: str = "gpt-4o"):
"""Demonstrate OpenAI's caching with previous_response_id."""
# Create system prompt with random text for the first iteration
system_prompt = f"Some random words: {generate_random_text(500)}... You are a math assistant. "
previous_id = None
previous_input_tokens = 0
# Run iterations
for i in range(1, iterations + 1):
# Generate random text and math question for each iteration
random_text = generate_random_text()
question = generate_math_question()
# Prepare input based on whether this is the first iteration
if i == 1:
input_messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"{random_text}\n\n{question}"},
]
else:
input_messages = [{"role": "user", "content": f"{random_text}\n\n{question}"}]
# Make request with previous_response_id (None for first iteration)
response = await client.responses.create(
model=model, previous_response_id=previous_id, input=input_messages
)
# Print caching metrics
print(f"\nTurn {i}:")
print(
f"Cached/Previous input tokens: {response.usage.input_tokens_details.cached_tokens}/{previous_input_tokens}"
)
previous_input_tokens = response.usage.input_tokens
# Update previous_id for next iteration
previous_id = response.id
# Brief pause between requests
if i < iterations:
await asyncio.sleep(5)
def main(iterations: int = 5, model: str = "o4-mini"):
"""Run simplified OpenAI caching investigation."""
asyncio.run(run_caching_demo(iterations, model))
if __name__ == "__main__":
fire.Fire(main)
Results:
Normal run with gpt-4o:
Turn 1:
Cached/Previous input tokens: 0/0
Turn 2:
Cached/Previous input tokens: 1518/1532
Turn 3:
Cached/Previous input tokens: 2542/2557
Turn 4:
Cached/Previous input tokens: 3566/3582
Turn 5:
Cached/Previous input tokens: 4590/4608
-> This is a normal run, 4o has a 80-90% cache accuracy (expected 100%)
Cache Break with gpt-4o:
Turn 1:
Cached/Previous input tokens: 0/0
Turn 2:
Cached/Previous input tokens: 1518/1532
Turn 3:
Cached/Previous input tokens: 2542/2556
Turn 4:
Cached/Previous input tokens: 0/3580
Turn 5:
Cached/Previous input tokens: 4590/4604
-> typical example of a random break with 4o
Multiple cache breaks and partial caching with o4-mini:
Turn 1:
Cached/Previous input tokens: 0/0
Turn 2:
Cached/Previous input tokens: 0/1531
Turn 3:
Cached/Previous input tokens: 1458/2551
Turn 4:
Cached/Previous input tokens: 0/3572
Turn 5:
Cached/Previous input tokens: 0/4593
-> standard run with multiple misses with o4-mini
tldr: Caching hardly works even when using previous_response_id with o4-mini. This makes using this model in agent environments roughly as expensive as a properly cached sonnet 3.7 (which costs 3x as much baseline).