Caching is borked for GPT 5 models

stroop · September 24, 2025, 4:08am

Trying to call the GPT5 model with the responses API.
My system prompt is consistent and is longer than 1024 tokens but I don’t hit the cache for any of the GPT5 model series input_tokens_details=InputTokensDetails(cached_tokens=0)

When I use the same for o3-mini and other older models i’m hitting the cache with same prompt and responses API

Saw some messages about not using the instructions param and instead passing in system prompt as part of input with role developer (for gpt5) but doesn’t work either

response = await client.responses.create(
        model=GPT_5_NANO,
        # model='o3-mini',
        input=[{'type': 'message', 'role': 'developer', 'content': 'static system prompt'},{'type': 'message', 'role': 'user', 'content': 'Okay'},
{'type': 'message', 'role': 'assistant', 'content': 'Its'}]               
    )

tried with role: ‘system’ for and providing a static prompt cache key like users email too but no luck with caching in GPT5-nano

Anyone able to get prompt caching working for GPT-5 series of models?

_j · September 24, 2025, 5:21am

Round 1: Standard Chat Models

For 3 trials of gpt-4o-2024-08-06 @ 2025-09-23 10:04PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 57.867	Cold: 69.2	Min: 50.3	Max: 69.2
latency (s)	Avg: 0.667	Cold: 0.6606	Min: 0.5961	Max: 0.7443
total response (s)	Avg: 2.903	Cold: 2.4962	Min: 2.4962	Max: 3.1217
total rate	Avg: 44.558	Cold: 51.278	Min: 41.003	Max: 51.278
response tokens	Avg: 128.000	Cold: 128	Min: 128	Max: 128

Cache statistics for gpt-4o-2024-08-06:

Total Trials: 3
Cache Hits (cached_tokens > 0): 3
Cache Misses (cached_tokens == 0): 0

Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%

Cached Tokens Counts:

Cached Tokens Value	Count
2432	3

For 3 trials of gpt-4o-2024-05-13 @ 2025-09-23 10:04PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 85.600	Cold: 98.7	Min: 44.2	Max: 113.9
latency (s)	Avg: 0.526	Cold: 0.5138	Min: 0.4992	Max: 0.5642
total response (s)	Avg: 2.285	Cold: 1.8005	Min: 1.6145	Max: 3.44
total rate	Avg: 62.527	Cold: 71.091	Min: 37.209	Max: 79.282
response tokens	Avg: 128.000	Cold: 128	Min: 128	Max: 128

Cache statistics for gpt-4o-2024-05-13:

Total Trials: 3
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 3

Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%

Cached Tokens Counts:

Cached Tokens Value	Count
0	3

For 3 trials of gpt-4o-2024-11-20 @ 2025-09-23 10:04PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 85.100	Cold: 80.0	Min: 75.0	Max: 100.3
latency (s)	Avg: 0.297	Cold: 0.3676	Min: 0.2545	Max: 0.3676
total response (s)	Avg: 1.812	Cold: 1.9542	Min: 1.5338	Max: 1.9542
total rate	Avg: 71.550	Cold: 65.5	Min: 65.5	Max: 83.453
response tokens	Avg: 128.000	Cold: 128	Min: 128	Max: 128

Cache statistics for gpt-4o-2024-11-20:

Total Trials: 3
Cache Hits (cached_tokens > 0): 3
Cache Misses (cached_tokens == 0): 0

Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%

Cached Tokens Counts:

Cached Tokens Value	Count
2432	3

For 3 trials of gpt-4o-mini @ 2025-09-23 10:04PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 48.600	Cold: 46.4	Min: 46.4	Max: 52.1
latency (s)	Avg: 0.565	Cold: 0.5645	Min: 0.4955	Max: 0.6336
total response (s)	Avg: 3.185	Cold: 3.3022	Min: 2.9327	Max: 3.32
total rate	Avg: 40.321	Cold: 38.762	Min: 38.554	Max: 43.646
response tokens	Avg: 128.000	Cold: 128	Min: 128	Max: 128

Cache statistics for gpt-4o-mini:

Total Trials: 3
Cache Hits (cached_tokens > 0): 3
Cache Misses (cached_tokens == 0): 0

Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%

Cached Tokens Counts:

Cached Tokens Value	Count
2432	3

identical requests: identical 100% cache performance

GPT-5 models

For 3 trials of gpt-5-nano-2025-08-07 @ 2025-09-23 10:13PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: -121.033	Cold: -147.1	Min: -166.7	Max: -49.3
latency (s)	Avg: 1.619	Cold: 1.9301	Min: 1.4589	Max: 1.9301
total response (s)	Avg: 1.630	Cold: 1.9369	Min: 1.4649	Max: 1.9369
total rate	Avg: 0.000	Cold: 0.0	Min: 0.0	Max: 0.0
response tokens	Avg: 0.000	Cold: 0	Min: 0	Max: 0

Cache statistics for gpt-5-nano-2025-08-07:

Total Trials: 3
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 3

Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%

Cached Tokens Counts:

Cached Tokens Value	Count
0	3

For 3 trials of gpt-5-mini-2025-08-07 @ 2025-09-23 10:13PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: -116.800	Cold: -47.6	Min: -163.9	Max: -47.6
latency (s)	Avg: 3.146	Cold: 3.5567	Min: 2.7559	Max: 3.5567
total response (s)	Avg: 3.157	Cold: 3.5777	Min: 2.762	Max: 3.5777
total rate	Avg: 0.000	Cold: 0.0	Min: 0.0	Max: 0.0
response tokens	Avg: 0.000	Cold: 0	Min: 0	Max: 0

Cache statistics for gpt-5-mini-2025-08-07:

Total Trials: 3
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 3

Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%

Cached Tokens Counts:

Cached Tokens Value	Count
0	3

For 3 trials of gpt-5-2025-08-07 @ 2025-09-23 10:13PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: -55.700	Cold: -76.3	Min: -76.3	Max: -41.8
latency (s)	Avg: 2.551	Cold: 2.1758	Min: 2.1758	Max: 3.0507
total response (s)	Avg: 2.570	Cold: 2.1889	Min: 2.1889	Max: 3.0746
total rate	Avg: 0.000	Cold: 0.0	Min: 0.0	Max: 0.0
response tokens	Avg: 0.000	Cold: 0	Min: 0	Max: 0

Cache statistics for gpt-5-2025-08-07:

Total Trials: 3
Cache Hits (cached_tokens > 0): 3
Cache Misses (cached_tokens == 0): 0

Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%

Cached Tokens Counts:

Cached Tokens Value	Count
2432	3

(note that max_completion_tokens was not long enough to pay for an actual response on gpt-5, thus the stream rate figures gone goofy reporting on how much output was NOT delivered after waiting)

Conclusion

`gpt-5-mini` and `nano` are not delivering the promised caching service and pricing.

Same as it has been, without resolution.

Method: no special cache key parameter was passed. The script uses an initial warm-up call, then sleeps for 5 seconds, and calls one-at-a-time round-robin.

Trial 1 for gpt-4o-2024-08-06................................................................................................................................. done.

Trial 1 for gpt-4o-2024-05-13................................................................................................................................. done.

Trial 1 for gpt-4o-2024-11-20................................................................................................................................. done.

Trial 1 for gpt-4o-mini................................................................................................................................. done.

Trial 2 for gpt-4o-2024-08-06................................................................................................................................. done.

Trial 2 for gpt-4o-2024-05-13................................................................................................................................. done.

Trial 2 for gpt-4o-2024-11-20................................................................................................................................. done.

Trial 2 for gpt-4o-mini................................................................................................................................. done.

Ioannis_Maras · September 25, 2025, 1:41pm

Same here, it seem that i randomly get cache hits and most cache miss without changing anything. Changing to any other model than gpt5 the problem does not happen .

Is there a case because gpt5 is like a model router . It uses different models behind the scenes and that causes a cache miss since it is not the same model ?

Ioannis_Maras · September 25, 2025, 3:41pm

After many hours or figuring out what is going on I think I found the solution.

I was using an old api version and for some reason it randomly returned or didnt return the cached tokens

Switching to OPENAI_AZURE_API_VERSION=2025-04-01-preview did the trick

stroop · September 26, 2025, 4:13am

I’m calling the openai api directly from openai services, are you using models hosted in azure?

I tried the latest model string as well gpt-5-mini-2025-08-07 but to no avail

Ioannis_Maras · September 26, 2025, 3:29pm

You actually correct after further testing with both azure openai and open ai api keys , the caching seems to be random . What worked for me in azure although pretty weird , I lowered the usage limit and managed to get almost 99% cache hits after the first message.

This maybe has to do with the fact that if your request is routed to a different server the cache will miss.

Finally testing with gpt-4.1 without changing anything it will 100% cache hit after the first message maybe due to lower usage

Ioannis_Maras · September 26, 2025, 3:59pm

Update: Using repsonses api rather than completions seems to pretty much solve the problem for gpt5 models

stroop · September 27, 2025, 10:49pm

interesting I am using responses API with openai api but still no caching.
Are you using the openai sdk for these?

prompteer · October 9, 2025, 2:05am

I’m using gpt-5-mini with the Response API from the javascript SDK. I noticed the same issue:

Long static system prompt (role=developer), 9,000+ tokens
Short varying user prompt (100-200 tokens)
The cache hit rate was 1 in 20. And I couldn’t reproduce that one hit afterward.
1. That one hit proved that the API is capable of caching.
I tried setting the prompt_cache_key with no effect.

I’m considering switching to Gemini now. Their caching is rock solid. Hit every time.

stroop · October 11, 2025, 11:14pm

@OpenAI_Support can we file a bug, this is a consistent issue for many users on any model from GPT5 series?

joz8 · October 20, 2025, 6:04pm

I’m also encountering this issue. No caching at all using gpt 5 nano with responses api.

satwik_dai · October 22, 2025, 10:36am

Facing the same issue for various GPT 5 models.
@OpenAI_Support Could you please address this

Foxalabs · October 23, 2025, 8:32am

Thanks for taking the time to flag this, I have raised it with the team.

Tomasz_Kostuch · November 12, 2025, 6:39am

Hey, I see that it’s still an issue. Is there any update for that?

joz8 · December 14, 2025, 10:49pm

Caching is broken again for at least 5.1 mini, but seems to work for 5.1. Can this please be prioritized? It’s potentially a costly issue.

Jake_Zhi_Ming · December 16, 2025, 2:38am

I’m facing the same issue with gpt 5 mini with hit rates between 10-60% depending on the task. It used to be above 90%.

OpenAI_Support · January 5, 2026, 7:49pm

We don't have any known cache issues (from what I found, at least), so to help me repro and investigate further, can you share:

Exact model snapshot and model param (full string)

API path (e.g. .../responses/...)

One full example request (curl/JSON) including prompt_cache_key and user (or “none”), and whether you used streaming or previous_response_id.

Can you run a controlled repro: two back-to-back identical requests with the same prompt_cache_key and user and share both responses + request IDs?

Thank you!

_j · January 5, 2026, 10:18pm

You have the exact same cache issues.

“user” field is just for safety now, unless this varies for model instead of being a universal change as documented. That guidance should not be needed.

Cache should be delivered regardless of “prompt_cache_key” in organizations with low or no other usage at the time of calls, as is documented. It is merely an additional input to the hashing of input that can break cache. That guidance should not be needed unless there is concurrent servicing.

For a script that makes a warmup call individually to each model in a test list sequentially, a slow process when these models are “reasoning”, and an additional 5 second sleep, I received 0 cache hits in the followup 1-trial used for statistics across all gpt-5 models.

Chat Completions, streaming, no “service_tier”, all models reports are like:

For 1 trials of gpt-5.2-2025-12-11 @ 2026-01-05 01:50PM:

Stat Average First Minimum Maximum

latency (s) 1.1238 1.1238 1.1238 1.1238

total response (s) 3.5793 3.5793 3.5793 3.5793

response tokens 128 128 128 128

total rate (tok/s) 35.761 35.761 35.761 35.761

stream rate (tok/s) 51.7 51.7 51.7 51.7

Cache statistics for gpt-5.2-2025-12-11:

Total Trials: 1
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 1
Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 0.0
Avg Cache Coverage: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count

0 1

The last part of the report is a histogram, here only 1 warmup and 1 trial per model.

So after one run then there are TWO calls that are identical that were capable of populating cache. This is me running my testing script again after a minute, another warmup call for three cache-able calls then finally tested - result, no cache for the same mini and nano models reported in September and no cache for gpt-5.2. 2432/2490 delivered only gpt-5 and gpt-5.1 then at call #4.

Log

Summary

(note, there’s a dot printed for every SSE output event, not displayed by forum software)

Warmup for gpt-5-nano-2025-08-07 done.

Warmup for gpt-5-mini-2025-08-07… done.

Warmup for gpt-5-2025-08-07 done.

Warmup for gpt-5.1-2025-11-13… done.

Warmup for gpt-5.2-2025-12-11… done.

Waiting for cache to be available…
Proceeding to trials.

Trial 1 for gpt-5-nano-2025-08-07 done.

Trial 1 for gpt-5-mini-2025-08-07… done.

Trial 1 for gpt-5-2025-08-07… done.

Trial 1 for gpt-5.1-2025-11-13… done.

Trial 1 for gpt-5.2-2025-12-11… done.

For 1 trials of gpt-5-nano-2025-08-07 @ 2026-01-05 01:54PM:

Stat	Average	First	Minimum	Maximum
latency (s)	9.2408	9.2408	9.2408	9.2408
total response (s)	9.2408	9.2408	9.2408	9.2408
response tokens	1024	1024	1024	1024
total rate (tok/s)	110.813	110.813	110.813	110.813
stream rate (tok/s)	1023.0	1023.0	1023.0	1023.0

Cache statistics for gpt-5-nano-2025-08-07:

Total Trials: 1
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 1
Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 0.0
Avg Cache Coverage: 0.00%

Cached Tokens Counts:

Cached Tokens Value	Count
0	1

For 1 trials of gpt-5-mini-2025-08-07 @ 2026-01-05 01:54PM:

Stat	Average	First	Minimum	Maximum
latency (s)	12.3435	12.3435	12.3435	12.3435
total response (s)	17.1476	17.1476	17.1476	17.1476
response tokens	1024	1024	1024	1024
total rate (tok/s)	59.717	59.717	59.717	59.717
stream rate (tok/s)	212.9	212.9	212.9	212.9

Cache statistics for gpt-5-mini-2025-08-07:

Total Trials: 1
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 1
Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 0.0
Avg Cache Coverage: 0.00%

Cached Tokens Counts:

Cached Tokens Value	Count
0	1

For 1 trials of gpt-5-2025-08-07 @ 2026-01-05 01:54PM:

Stat	Average	First	Minimum	Maximum
latency (s)	20.1620	20.1620	20.1620	20.1620
total response (s)	20.7592	20.7592	20.7592	20.7592
response tokens	1024	1024	1024	1024
total rate (tok/s)	49.328	49.328	49.328	49.328
stream rate (tok/s)	1713.0	1713.0	1713.0	1713.0

Cache statistics for gpt-5-2025-08-07:

Total Trials: 1
Cache Hits (cached_tokens > 0): 1
Cache Misses (cached_tokens == 0): 0
Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 2432.0
Avg Cache Coverage: 97.67%

Cached Tokens Counts:

Cached Tokens Value	Count
2432	1

For 1 trials of gpt-5.1-2025-11-13 @ 2026-01-05 01:54PM:

Stat	Average	First	Minimum	Maximum
latency (s)	0.5681	0.5681	0.5681	0.5681
total response (s)	7.7394	7.7394	7.7394	7.7394
response tokens	410	410	410	410
total rate (tok/s)	52.976	52.976	52.976	52.976
stream rate (tok/s)	57.0	57.0	57.0	57.0

Cache statistics for gpt-5.1-2025-11-13:

Total Trials: 1
Cache Hits (cached_tokens > 0): 1
Cache Misses (cached_tokens == 0): 0
Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 2432.0
Avg Cache Coverage: 97.67%

Cached Tokens Counts:

Cached Tokens Value	Count
2432	1

For 1 trials of gpt-5.2-2025-12-11 @ 2026-01-05 01:54PM:

Stat	Average	First	Minimum	Maximum
latency (s)	0.4638	0.4638	0.4638	0.4638
total response (s)	7.2426	7.2426	7.2426	7.2426
response tokens	308	308	308	308
total rate (tok/s)	42.526	42.526	42.526	42.526
stream rate (tok/s)	45.3	45.3	45.3	45.3

Cache statistics for gpt-5.2-2025-12-11:

Total Trials: 1
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 1
Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 0.0
Avg Cache Coverage: 0.00%

Cached Tokens Counts:

Cached Tokens Value	Count
0	1

Replication code

…enjoy more free work ironically while overbilling API clients continues.

from __future__ import annotations

from dataclasses import dataclass
from typing import Any, Iterable
import time

import openai  # requires: pip install openai


# -----------------------------
# Test parameters
# -----------------------------
MODELS = [
    "gpt-5-nano-2025-08-07",
    "gpt-5-mini-2025-08-07",
    "gpt-5-2025-08-07",
    "gpt-5.1-2025-11-13",
    "gpt-5.2-2025-12-11",
]

TRIALS = 1
MAX_COMPLETION_TOKENS = 1024  # keep small; we care about prompt caching
SLEEP_AFTER_WARMUP_S = 5

DISPLAY_CACHE = True
DISPLAY_BENCHMARK = True


PROMPT = "Write another poem, about a puppy, notably different."

HISTORY_TEXT = """
Certainly! Here are ten different poems about kittens, each with six stanzas:

---

**1. The Playful Kitten**

In the morning light, a kitten plays,  
Chasing shadows in a sunlit haze.  
With tiny paws and a curious gaze,  
It leaps and bounds in joyful ways.

A ball of yarn becomes its prey,  
Rolling, tumbling, throughout the day.  
With every pounce, it seems to say,  
"Come join me in my playful fray."

The world is vast, a grand expanse,  
For a kitten's heart, a place to dance.  
In every corner, a new chance,  
To explore, to leap, to prance.

With whiskers twitching, ears alert,  
It scampers 'cross the grassy dirt.  
A gentle breeze lifts its furred shirt,  
As it chases leaves with a joyful spurt.

When the sun dips low, the kitten rests,  
Curled in a ball, its cozy nest.  
Dreams of adventures fill its chest,  
In slumber, it finds its peaceful best.

The night is calm, the stars aglow,  
The kitten sleeps, its breath is slow.  
In dreams, it dances to and fro,  
A playful spirit, forever aglow.

---

**2. The Curious Explorer**

A kitten peeks through the garden gate,  
With eyes so wide, it cannot wait.  
To see the world, to contemplate,  
The wonders that lie beyond its fate.

A butterfly flutters by its nose,  
The kitten leaps, its excitement grows.  
In the garden, where the wild rose,  
Blooms in colors that the kitten knows.

The grass is soft beneath its feet,  
A world of scents, a world to greet.  
With every step, a new heartbeat,  
In the garden, where adventures meet.

A rustle in the leaves, a sound,  
The kitten pauses, looks around.  
In the silence, it is spellbound,  
By the mysteries that abound.

The sun sets low, the shadows creep,  
The kitten yawns, its eyes half-sleep.  
In the garden, where secrets keep,  
It finds a place to rest and weep.

The stars come out, the moon is bright,  
The kitten dreams in the soft moonlight.  
In its heart, a gentle delight,  
For the world it explored with all its might.

---

**3. The Gentle Companion**

A kitten sits by the window pane,  
Watching the world through the falling rain.  
With eyes so soft, it feels no pain,  
In its heart, a gentle refrain.

A friend to all, with a tender purr,  
It comforts those who come to confer.  
In its presence, hearts never stir,  
For the kitten's love is a gentle blur.

With a nuzzle and a soft embrace,  
It brings a smile to every face.  
In its warmth, there's a special grace,  
A gentle love that leaves no trace.

The world outside may be cold and gray,  
But the kitten's heart is warm and gay.  
In its presence, worries fade away,  
For its love is here to stay.

As the night falls, the kitten sleeps,  
In dreams, its gentle spirit leaps.  
In its heart, a love that keeps,  
A gentle soul that never weeps.

The dawn breaks, the world awakes,  
The kitten stirs, its heart still aches.  
For the love it gives, the love it takes,  
A gentle companion, for all our sakes.

---

**4. The Mischievous Kitten**

A kitten with a twinkle in its eye,  
Plots its next adventure, oh so sly.  
With a flick of its tail, it leaps high,  
Into the world where mischief lies.

A vase on the table, a tempting sight,  
The kitten ponders, its eyes alight.  
With a gentle nudge, it takes flight,  
And crashes down in the quiet night.

A curtain sways, a perfect climb,  
The kitten scales it in no time.  
With a playful meow, a joyful chime,  
It swings and sways, a rhythmic rhyme.

A shoe becomes a hiding place,  
The kitten peeks out with a cheeky face.  
In its eyes, a mischievous trace,  
As it darts away in a playful race.

The day is done, the kitten rests,  
In dreams, it plans its next conquests.  
With a purr, it snuggles in its nest,  
A mischievous heart, forever blessed.

The night is calm, the stars are bright,  
The kitten dreams of its playful plight.  
In its heart, a mischievous light,  
A playful spirit, taking flight.

---

**5. The Adventurous Spirit**

A kitten with a heart so bold,  
Ventures forth into the world untold.  
With every step, a story unfolds,  
In the heart of a kitten, brave and bold.

The forest calls, a whispering breeze,  
The kitten answers with gentle ease.  
In the shadows, where the wild trees,  
Stand tall and proud, the kitten sees.

A river flows, a gentle stream,  
The kitten leaps, its eyes agleam.  
In the water, where the fish gleam,  
It finds a world of endless dreams.

The mountains rise, a daunting sight,  
The kitten climbs with all its might.  
In the heights, where the eagles take flight,  
It finds a world of pure delight.

The sun sets low, the world is still,  
The kitten rests on a quiet hill.  
In its heart, a gentle thrill,  
For the adventures that it will fulfill.

The stars come out, the night is clear,  
The kitten dreams without a fear.  
In its heart, a world so dear,  
An adventurous spirit, forever near.

---

**6. The Dreamer**

A kitten dreams beneath the stars,  
Of distant lands and worlds afar.  
In its heart, a gentle spar,  
A dreamer with a wishful star.

The moonlight dances on its fur,  
A gentle breeze begins to stir.  
In its dreams, the world is a blur,  
A place where dreams and wishes occur.

A castle stands on a distant hill,  
The kitten dreams of a world so still.  
In its heart, a gentle thrill,  
For the dreams that it will fulfill.

A ship sails on a moonlit sea,  
The kitten dreams of being free.  
In its heart, a gentle plea,  
For the dreams that it will see.

The stars twinkle in the night,  
The kitten dreams of a world so bright.  
In its heart, a gentle light,  
A dreamer with a wishful sight.

The dawn breaks, the world awakes,  
The kitten stirs, its heart still aches.  
For the dreams it dreams, the dreams it makes,  
A dreamer with a heart that never breaks.

---

**7. The Gentle Healer**

A kitten with a gentle touch,  
Brings comfort to those who need it much.  
With a purr and a gentle clutch,  
It heals the heart with a gentle hush.

A friend to all, with a tender heart,  
It brings warmth to those who fall apart.  
In its presence, a gentle start,  
To heal the wounds that tear apart.

With a nuzzle and a gentle purr,  
It brings peace to those who confer.  
In its warmth, a gentle blur,  
A healing touch that leaves no stir.

The world may be harsh and cold,  
But the kitten's heart is warm and bold.  
In its presence, worries unfold,  
For its love is a story untold.

As the night falls, the kitten sleeps,  
In dreams, its gentle spirit leaps.  
In its heart, a love that keeps,  
A gentle healer, for all our sakes.

The dawn breaks, the world awakes,  
The kitten stirs, its heart still aches.  
For the love it gives, the love it takes,  
A gentle healer, for all our sakes.

---

**8. The Playful Spirit**

A kitten with a playful heart,  
Leaps and bounds, a joyful start.  
In the garden, where the flowers part,  
It finds a world of playful art.

A butterfly flutters by its nose,  
The kitten leaps, its excitement grows.  
In the garden, where the wild rose,  
Blooms in colors that the kitten knows.

The grass is soft beneath its feet,  
A world of scents, a world to greet.  
With every step, a new heartbeat,  
In the garden, where adventures meet.

A rustle in the leaves, a sound,  
The kitten pauses, looks around.  
In the silence, it is spellbound,  
By the mysteries that abound.

The sun sets low, the shadows creep,  
The kitten yawns, its eyes half-sleep.  
In the garden, where secrets keep,  
It finds a place to rest and weep.

The stars come out, the moon is bright,  
The kitten dreams in the soft moonlight.  
In its heart, a gentle delight,  
For the world it explored with all its might.

---

**9. The Gentle Guardian**

A kitten with a watchful eye,  
Guards the home where loved ones lie.  
With a gentle purr and a soft sigh,  
It watches over with a loving eye.

A friend to all, with a tender heart,  
It brings warmth to those who fall apart.  
In its presence, a gentle start,  
To guard the hearts that tear apart.

With a nuzzle and a gentle purr,  
It brings peace to those who confer.  
In its warmth, a gentle blur,  
A guardian's touch that leaves no stir.

The world may be harsh and cold,  
But the kitten's heart is warm and bold.  
In its presence, worries unfold,  
For its love is a story untold.

As the night falls, the kitten sleeps,  
In dreams, its gentle spirit leaps.  
In its heart, a love that keeps,  
A gentle guardian, for all our sakes.

The dawn breaks, the world awakes,  
The kitten stirs, its heart still aches.  
For the love it gives, the love it takes,  
A gentle guardian, for all our sakes.

---

**10. The Joyful Kitten**

A kitten with a joyful heart,  
Leaps and bounds, a joyful start.  
In the garden, where the flowers part,  
It finds a world of joyful art.

A butterfly flutters by its nose,  
The kitten leaps, its excitement grows.  
In the garden, where the wild rose,  
Blooms in colors that the kitten knows.

The grass is soft beneath its feet,  
A world of scents, a world to greet.  
With every step, a new heartbeat,  
In the garden, where adventures meet.

A rustle in the leaves, a sound,  
The kitten pauses, looks around.  
In the silence, it is spellbound,  
By the mysteries that abound.

The sun sets low, the shadows creep,  
The kitten yawns, its eyes half-sleep.  
In the garden, where secrets keep,  
It finds a place to rest and weep.

The stars come out, the moon is bright,  
The kitten dreams in the soft moonlight.  
In its heart, a gentle delight,  
For the world it explored with all its might.

---

Feel free to use these poems!
""".strip()

HISTORY: list[dict[str, Any]] = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": (
                    "You are a creative author. "
                    "You are powered by an AI model with a 128k context window for responses."
                ),
            }
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Write ten different poems about kittens, each with six stanzas.",
            }
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": HISTORY_TEXT,
            }
        ],
    },
]


# -----------------------------
# Benchmark plumbing
# -----------------------------
@dataclass(frozen=True, slots=True)
class CallMetrics:
    latency_s: float
    total_s: float
    prompt_tokens: int
    completion_tokens: int
    cached_prompt_tokens: int

    @property
    def total_rate(self) -> float:
        if self.total_s <= 0:
            return 0.0
        return self.completion_tokens / self.total_s

    @property
    def stream_rate(self) -> float:
        streamed_tokens = max(self.completion_tokens - 1, 0)
        stream_time = max(self.total_s - self.latency_s, 0.0)
        return streamed_tokens / (stream_time if stream_time > 0 else 1.0)


def now_str() -> str:
    return time.strftime("%Y-%m-%d %I:%M%p", time.localtime(time.time()))


def safe_int(value: Any) -> int:
    try:
        return int(value)
    except Exception:
        return 0


def extract_cached_tokens(usage: Any) -> int:
    details = getattr(usage, "prompt_tokens_details", None)
    if details is None:
        return 0
    return safe_int(getattr(details, "cached_tokens", 0))


def run_streamed_call(
    *,
    client: openai.Client,
    model: str,
    messages: list[dict[str, Any]],
    max_completion_tokens: int,
    show_progress: bool,
) -> CallMetrics:
    start = time.perf_counter()
    first_content_at: float | None = None
    usage_obj: Any | None = None

    stream = client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
        max_completion_tokens=max_completion_tokens,
        stream_options={"include_usage": True},
    )

    for chunk in stream:
        if getattr(chunk, "usage", None) is not None:
            usage_obj = chunk.usage

        choices = getattr(chunk, "choices", None)
        if not choices:
            continue

        delta = getattr(choices[0], "delta", None)
        if delta is None:
            continue

        content = getattr(delta, "content", None)
        if not content:
            continue

        if first_content_at is None:
            first_content_at = time.perf_counter()

        if show_progress:
            print(".", end="", flush=True)

    end = time.perf_counter()

    latency_s = (first_content_at - start) if first_content_at is not None else (end - start)
    total_s = end - start

    prompt_tokens = safe_int(getattr(usage_obj, "prompt_tokens", 0)) if usage_obj else 0
    completion_tokens = safe_int(getattr(usage_obj, "completion_tokens", 0)) if usage_obj else 0
    cached_prompt_tokens = extract_cached_tokens(usage_obj) if usage_obj else 0

    return CallMetrics(
        latency_s=round(latency_s, 4),
        total_s=round(total_s, 4),
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        cached_prompt_tokens=cached_prompt_tokens,
    )


def run_models_phase(
    *,
    client: openai.Client,
    models: Iterable[str],
    messages: list[dict[str, Any]],
    max_completion_tokens: int,
    label: str,
    show_progress: bool,
) -> dict[str, CallMetrics | None]:
    results: dict[str, CallMetrics | None] = {}
    for model in models:
        print(f"\n{label} for {model}", end="", flush=True)
        try:
            metrics = run_streamed_call(
                client=client,
                model=model,
                messages=messages,
                max_completion_tokens=max_completion_tokens,
                show_progress=show_progress,
            )
            print(" done.")
            results[model] = metrics
        except Exception as e:
            print(f" error: {e}")
            results[model] = None
    return results


def stats_row(values: list[float]) -> tuple[float, float, float, float] | None:
    if not values:
        return None
    first = values[0]
    return (sum(values) / len(values), first, min(values), max(values))


def print_reports(results: dict[str, list[CallMetrics]]) -> None:
    stamp = now_str()

    for model, calls in results.items():
        if DISPLAY_BENCHMARK:
            print(f"### For {len(calls)} trials of {model} @ {stamp}:")
            print("| Stat | Average | First | Minimum | Maximum |")
            print("| --- | ---: | ---: | ---: | ---: |")

            latency = [c.latency_s for c in calls]
            total = [c.total_s for c in calls]
            completion_tokens = [float(c.completion_tokens) for c in calls]
            total_rate = [c.total_rate for c in calls]
            stream_rate = [c.stream_rate for c in calls]

            for name, values, fmt in [
                ("latency (s)", latency, "{:.4f}"),
                ("total response (s)", total, "{:.4f}"),
                ("response tokens", completion_tokens, "{:.0f}"),
                ("total rate (tok/s)", total_rate, "{:.3f}"),
                ("stream rate (tok/s)", stream_rate, "{:.1f}"),
            ]:
                row = stats_row(values)
                if row is None:
                    print(f"| {name} | N/A | N/A | N/A | N/A |")
                    continue
                avg, first, mn, mx = row
                print(
                    f"| {name} | {fmt.format(avg)} | {fmt.format(first)} | {fmt.format(mn)} | {fmt.format(mx)} |"
                )
            print()

        if DISPLAY_CACHE:
            cached = [c.cached_prompt_tokens for c in calls]
            prompt = [c.prompt_tokens for c in calls]

            cache_hits = sum(1 for t in cached if t > 0)
            cache_misses = sum(1 for t in cached if t == 0)
            total_trials = len(cached)

            avg_prompt = (sum(prompt) / len(prompt)) if prompt else 0.0
            avg_cached = (sum(cached) / len(cached)) if cached else 0.0
            avg_cov = (100.0 * avg_cached / avg_prompt) if avg_prompt > 0 else 0.0

            print(f"### Cache statistics for {model}:")
            print(f"Total Trials: {total_trials}")
            print(f"Cache Hits (cached_tokens > 0): {cache_hits}")
            print(f"Cache Misses (cached_tokens == 0): {cache_misses}")
            if total_trials > 0:
                print(f"Cache Hit Rate: {(cache_hits / total_trials) * 100:.2f}%")
                print(f"Cache Miss Rate: {(cache_misses / total_trials) * 100:.2f}%")
            print(f"Avg Prompt Tokens: {avg_prompt:.1f}")
            print(f"Avg Cached Tokens: {avg_cached:.1f}")
            print(f"Avg Cache Coverage: {avg_cov:.2f}%")
            print()

            token_counts: dict[int, int] = {}
            for t in cached:
                token_counts[t] = token_counts.get(t, 0) + 1

            print("#### Cached Tokens Counts:")
            print("| Cached Tokens Value | Count |")
            print("| ---: | ---: |")
            for tokens_value in sorted(token_counts.keys()):
                print(f"| {tokens_value} | {token_counts[tokens_value]} |")
            print()


def main() -> None:
    client = openai.Client(timeout=120, max_retries=0)  # Uses OPENAI_API_KEY env var

    messages = HISTORY + [{"role": "user", "content": PROMPT}]

    run_models_phase(
        client=client,
        models=MODELS,
        messages=messages,
        max_completion_tokens=MAX_COMPLETION_TOKENS,
        label="Warmup",
        show_progress=True,
    )

    if SLEEP_AFTER_WARMUP_S > 0:
        print("\nWaiting for cache to be available...")
        time.sleep(SLEEP_AFTER_WARMUP_S)
        print("Proceeding to trials.\n")

    results: dict[str, list[CallMetrics]] = {m: [] for m in MODELS}
    for i in range(TRIALS):
        phase = run_models_phase(
            client=client,
            models=MODELS,
            messages=messages,
            max_completion_tokens=MAX_COMPLETION_TOKENS,
            label=f"Trial {i + 1}",
            show_progress=True,
        )
        for model, metrics in phase.items():
            if metrics is not None:
                results[model].append(metrics)

    print()
    print_reports(results)


if __name__ == "__main__":
    main()

vb · January 5, 2026, 10:28pm

I’ll add my input by linking two more recent independent reports on the same issue. Hope this helps pinpoint the root cause!

Dobo · January 8, 2026, 8:20pm

Hi @vb , my caching issue was a prompt engineering bug, so no issues on my side. Will report in the thread.

Topic		Replies	Views
We need to talk about prompt caching Feedback prompt-caching , responses-api , chat-completions-api	1	395	October 25, 2025
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	359	November 14, 2024
'OpenAI' object has no attribute 'Completion' API	13	405	January 14, 2025
Assistant API Response is Extremely Slow API assistants-api	1	563	October 21, 2024
OpenAI Why Are The API Calls So Slow? When will it be fixed? API	103	56982	February 19, 2024

Stat	Average	First	Minimum	Maximum
latency (s)	1.1238	1.1238	1.1238	1.1238
total response (s)	3.5793	3.5793	3.5793	3.5793
response tokens	128	128	128	128
total rate (tok/s)	35.761	35.761	35.761	35.761
stream rate (tok/s)	51.7	51.7	51.7	51.7

Caching is borked for GPT 5 models

Round 1: Standard Chat Models

For 3 trials of gpt-4o-2024-08-06 @ 2025-09-23 10:04PM:

Cache statistics for gpt-4o-2024-08-06:

Cached Tokens Counts:

For 3 trials of gpt-4o-2024-05-13 @ 2025-09-23 10:04PM:

Cache statistics for gpt-4o-2024-05-13:

Cached Tokens Counts:

For 3 trials of gpt-4o-2024-11-20 @ 2025-09-23 10:04PM:

Cache statistics for gpt-4o-2024-11-20:

Cached Tokens Counts:

For 3 trials of gpt-4o-mini @ 2025-09-23 10:04PM:

Cache statistics for gpt-4o-mini:

Cached Tokens Counts:

GPT-5 models

For 3 trials of gpt-5-nano-2025-08-07 @ 2025-09-23 10:13PM:

Cache statistics for gpt-5-nano-2025-08-07:

Cached Tokens Counts:

For 3 trials of gpt-5-mini-2025-08-07 @ 2025-09-23 10:13PM:

Cache statistics for gpt-5-mini-2025-08-07:

Cached Tokens Counts:

For 3 trials of gpt-5-2025-08-07 @ 2025-09-23 10:13PM:

Cache statistics for gpt-5-2025-08-07:

Cached Tokens Counts:

Conclusion

gpt-5-mini and nano are not delivering the promised caching service and pricing.

For 1 trials of gpt-5.2-2025-12-11 @ 2026-01-05 01:50PM:

Cache statistics for gpt-5.2-2025-12-11:

Cached Tokens Counts:

Log

For 1 trials of gpt-5-nano-2025-08-07 @ 2026-01-05 01:54PM:

Cache statistics for gpt-5-nano-2025-08-07:

Cached Tokens Counts:

For 1 trials of gpt-5-mini-2025-08-07 @ 2026-01-05 01:54PM:

Cache statistics for gpt-5-mini-2025-08-07:

Cached Tokens Counts:

For 1 trials of gpt-5-2025-08-07 @ 2026-01-05 01:54PM:

Cache statistics for gpt-5-2025-08-07:

Cached Tokens Counts:

For 1 trials of gpt-5.1-2025-11-13 @ 2026-01-05 01:54PM:

Cache statistics for gpt-5.1-2025-11-13:

Cached Tokens Counts:

For 1 trials of gpt-5.2-2025-12-11 @ 2026-01-05 01:54PM:

Cache statistics for gpt-5.2-2025-12-11:

Cached Tokens Counts:

Replication code

Related topics

`gpt-5-mini` and `nano` are not delivering the promised caching service and pricing.