Caching is borked for GPT 5 models

Trying to call the GPT5 model with the responses API.
My system prompt is consistent and is longer than 1024 tokens but I don’t hit the cache for any of the GPT5 model series input_tokens_details=InputTokensDetails(cached_tokens=0)

When I use the same for o3-mini and other older models i’m hitting the cache with same prompt and responses API

Saw some messages about not using the instructions param and instead passing in system prompt as part of input with role developer (for gpt5) but doesn’t work either

response = await client.responses.create(
        model=GPT_5_NANO,
        # model='o3-mini',
        input=[{'type': 'message', 'role': 'developer', 'content': 'static system prompt'},{'type': 'message', 'role': 'user', 'content': 'Okay'},
{'type': 'message', 'role': 'assistant', 'content': 'Its'}]               
    )

tried with role: ‘system’ for and providing a static prompt cache key like users email too but no luck with caching in GPT5-nano

Anyone able to get prompt caching working for GPT-5 series of models?

Round 1: Standard Chat Models

For 3 trials of gpt-4o-2024-08-06 @ 2025-09-23 10:04PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 57.867 Cold: 69.2 Min: 50.3 Max: 69.2
latency (s) Avg: 0.667 Cold: 0.6606 Min: 0.5961 Max: 0.7443
total response (s) Avg: 2.903 Cold: 2.4962 Min: 2.4962 Max: 3.1217
total rate Avg: 44.558 Cold: 51.278 Min: 41.003 Max: 51.278
response tokens Avg: 128.000 Cold: 128 Min: 128 Max: 128

Cache statistics for gpt-4o-2024-08-06:

Total Trials: 3
Cache Hits (cached_tokens > 0): 3
Cache Misses (cached_tokens == 0): 0

Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count
2432 3

For 3 trials of gpt-4o-2024-05-13 @ 2025-09-23 10:04PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 85.600 Cold: 98.7 Min: 44.2 Max: 113.9
latency (s) Avg: 0.526 Cold: 0.5138 Min: 0.4992 Max: 0.5642
total response (s) Avg: 2.285 Cold: 1.8005 Min: 1.6145 Max: 3.44
total rate Avg: 62.527 Cold: 71.091 Min: 37.209 Max: 79.282
response tokens Avg: 128.000 Cold: 128 Min: 128 Max: 128

Cache statistics for gpt-4o-2024-05-13:

Total Trials: 3
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 3

Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%

Cached Tokens Counts:

Cached Tokens Value Count
0 3

For 3 trials of gpt-4o-2024-11-20 @ 2025-09-23 10:04PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 85.100 Cold: 80.0 Min: 75.0 Max: 100.3
latency (s) Avg: 0.297 Cold: 0.3676 Min: 0.2545 Max: 0.3676
total response (s) Avg: 1.812 Cold: 1.9542 Min: 1.5338 Max: 1.9542
total rate Avg: 71.550 Cold: 65.5 Min: 65.5 Max: 83.453
response tokens Avg: 128.000 Cold: 128 Min: 128 Max: 128

Cache statistics for gpt-4o-2024-11-20:

Total Trials: 3
Cache Hits (cached_tokens > 0): 3
Cache Misses (cached_tokens == 0): 0

Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count
2432 3

For 3 trials of gpt-4o-mini @ 2025-09-23 10:04PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 48.600 Cold: 46.4 Min: 46.4 Max: 52.1
latency (s) Avg: 0.565 Cold: 0.5645 Min: 0.4955 Max: 0.6336
total response (s) Avg: 3.185 Cold: 3.3022 Min: 2.9327 Max: 3.32
total rate Avg: 40.321 Cold: 38.762 Min: 38.554 Max: 43.646
response tokens Avg: 128.000 Cold: 128 Min: 128 Max: 128

Cache statistics for gpt-4o-mini:

Total Trials: 3
Cache Hits (cached_tokens > 0): 3
Cache Misses (cached_tokens == 0): 0

Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count
2432 3
  • identical requests: identical 100% cache performance

GPT-5 models

For 3 trials of gpt-5-nano-2025-08-07 @ 2025-09-23 10:13PM:

Stat Average Cold Minimum Maximum
stream rate Avg: -121.033 Cold: -147.1 Min: -166.7 Max: -49.3
latency (s) Avg: 1.619 Cold: 1.9301 Min: 1.4589 Max: 1.9301
total response (s) Avg: 1.630 Cold: 1.9369 Min: 1.4649 Max: 1.9369
total rate Avg: 0.000 Cold: 0.0 Min: 0.0 Max: 0.0
response tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0

Cache statistics for gpt-5-nano-2025-08-07:

Total Trials: 3
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 3

Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%

Cached Tokens Counts:

Cached Tokens Value Count
0 3

For 3 trials of gpt-5-mini-2025-08-07 @ 2025-09-23 10:13PM:

Stat Average Cold Minimum Maximum
stream rate Avg: -116.800 Cold: -47.6 Min: -163.9 Max: -47.6
latency (s) Avg: 3.146 Cold: 3.5567 Min: 2.7559 Max: 3.5567
total response (s) Avg: 3.157 Cold: 3.5777 Min: 2.762 Max: 3.5777
total rate Avg: 0.000 Cold: 0.0 Min: 0.0 Max: 0.0
response tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0

Cache statistics for gpt-5-mini-2025-08-07:

Total Trials: 3
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 3

Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%

Cached Tokens Counts:

Cached Tokens Value Count
0 3

For 3 trials of gpt-5-2025-08-07 @ 2025-09-23 10:13PM:

Stat Average Cold Minimum Maximum
stream rate Avg: -55.700 Cold: -76.3 Min: -76.3 Max: -41.8
latency (s) Avg: 2.551 Cold: 2.1758 Min: 2.1758 Max: 3.0507
total response (s) Avg: 2.570 Cold: 2.1889 Min: 2.1889 Max: 3.0746
total rate Avg: 0.000 Cold: 0.0 Min: 0.0 Max: 0.0
response tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0

Cache statistics for gpt-5-2025-08-07:

Total Trials: 3
Cache Hits (cached_tokens > 0): 3
Cache Misses (cached_tokens == 0): 0

Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count
2432 3

(note that max_completion_tokens was not long enough to pay for an actual response on gpt-5, thus the stream rate figures gone goofy reporting on how much output was NOT delivered after waiting)

Conclusion

gpt-5-mini and nano are not delivering the promised caching service and pricing.

Same as it has been, without resolution.

Method: no special cache key parameter was passed. The script uses an initial warm-up call, then sleeps for 5 seconds, and calls one-at-a-time round-robin.

Trial 1 for gpt-4o-2024-08-06................................................................................................................................. done.

Trial 1 for gpt-4o-2024-05-13................................................................................................................................. done.

Trial 1 for gpt-4o-2024-11-20................................................................................................................................. done.

Trial 1 for gpt-4o-mini................................................................................................................................. done.

Trial 2 for gpt-4o-2024-08-06................................................................................................................................. done.

Trial 2 for gpt-4o-2024-05-13................................................................................................................................. done.

Trial 2 for gpt-4o-2024-11-20................................................................................................................................. done.

Trial 2 for gpt-4o-mini................................................................................................................................. done.
2 Likes

Same here, it seem that i randomly get cache hits and most cache miss without changing anything. Changing to any other model than gpt5 the problem does not happen .

Is there a case because gpt5 is like a model router . It uses different models behind the scenes and that causes a cache miss since it is not the same model ?

1 Like

After many hours or figuring out what is going on I think I found the solution.

I was using an old api version and for some reason it randomly returned or didnt return the cached tokens

Switching to OPENAI_AZURE_API_VERSION=2025-04-01-preview did the trick

1 Like

I’m calling the openai api directly from openai services, are you using models hosted in azure?

I tried the latest model string as well gpt-5-mini-2025-08-07 but to no avail

1 Like

You actually correct after further testing with both azure openai and open ai api keys , the caching seems to be random . What worked for me in azure although pretty weird , I lowered the usage limit and managed to get almost 99% cache hits after the first message.

This maybe has to do with the fact that if your request is routed to a different server the cache will miss.

Finally testing with gpt-4.1 without changing anything it will 100% cache hit after the first message maybe due to lower usage

Update: Using repsonses api rather than completions seems to pretty much solve the problem for gpt5 models

1 Like

interesting I am using responses API with openai api but still no caching.
Are you using the openai sdk for these?

I’m using gpt-5-mini with the Response API from the javascript SDK. I noticed the same issue:

  1. Long static system prompt (role=developer), 9,000+ tokens
  2. Short varying user prompt (100-200 tokens)
  3. The cache hit rate was 1 in 20. And I couldn’t reproduce that one hit afterward.
    1. That one hit proved that the API is capable of caching.
  4. I tried setting the prompt_cache_key with no effect.

I’m considering switching to Gemini now. Their caching is rock solid. Hit every time.

@OpenAI_Support can we file a bug, this is a consistent issue for many users on any model from GPT5 series?

I’m also encountering this issue. No caching at all using gpt 5 nano with responses api.

Facing the same issue for various GPT 5 models.
@OpenAI_Support Could you please address this

Thanks for taking the time to flag this, I have raised it with the team.

3 Likes

Hey, I see that it’s still an issue. Is there any update for that?

Caching is broken again for at least 5.1 mini, but seems to work for 5.1. Can this please be prioritized? It’s potentially a costly issue.

I’m facing the same issue with gpt 5 mini with hit rates between 10-60% depending on the task. It used to be above 90%.

We don't have any known cache issues (from what I found, at least), so to help me repro and investigate further, can you share:


Exact model snapshot and model param (full string)

API path (e.g. .../responses/...)

One full example request (curl/JSON) including prompt_cache_key and user (or “none”), and whether you used streaming or previous_response_id.

Can you run a controlled repro: two back-to-back identical requests with the same prompt_cache_key and user and share both responses + request IDs?


Thank you!

2 Likes

You have the exact same cache issues.

“user” field is just for safety now, unless this varies for model instead of being a universal change as documented. That guidance should not be needed.

Cache should be delivered regardless of “prompt_cache_key” in organizations with low or no other usage at the time of calls, as is documented. It is merely an additional input to the hashing of input that can break cache. That guidance should not be needed unless there is concurrent servicing.


For a script that makes a warmup call individually to each model in a test list sequentially, a slow process when these models are “reasoning”, and an additional 5 second sleep, I received 0 cache hits in the followup 1-trial used for statistics across all gpt-5 models.

Chat Completions, streaming, no “service_tier”, all models reports are like:

For 1 trials of gpt-5.2-2025-12-11 @ 2026-01-05 01:50PM:

Stat Average First Minimum Maximum
latency (s) 1.1238 1.1238 1.1238 1.1238
total response (s) 3.5793 3.5793 3.5793 3.5793
response tokens 128 128 128 128
total rate (tok/s) 35.761 35.761 35.761 35.761
stream rate (tok/s) 51.7 51.7 51.7 51.7

Cache statistics for gpt-5.2-2025-12-11:

Total Trials: 1
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 1
Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 0.0
Avg Cache Coverage: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count
0 1

The last part of the report is a histogram, here only 1 warmup and 1 trial per model.

So after one run then there are TWO calls that are identical that were capable of populating cache. This is me running my testing script again after a minute, another warmup call for three cache-able calls then finally tested - result, no cache for the same mini and nano models reported in September and no cache for gpt-5.2. 2432/2490 delivered only gpt-5 and gpt-5.1 then at call #4.

Log

Summary

(note, there’s a dot printed for every SSE output event, not displayed by forum software)

Warmup for gpt-5-nano-2025-08-07 done.

Warmup for gpt-5-mini-2025-08-07… done.

Warmup for gpt-5-2025-08-07 done.

Warmup for gpt-5.1-2025-11-13… done.

Warmup for gpt-5.2-2025-12-11… done.

Waiting for cache to be available…
Proceeding to trials.

Trial 1 for gpt-5-nano-2025-08-07 done.

Trial 1 for gpt-5-mini-2025-08-07… done.

Trial 1 for gpt-5-2025-08-07… done.

Trial 1 for gpt-5.1-2025-11-13… done.

Trial 1 for gpt-5.2-2025-12-11… done.

For 1 trials of gpt-5-nano-2025-08-07 @ 2026-01-05 01:54PM:

Stat Average First Minimum Maximum
latency (s) 9.2408 9.2408 9.2408 9.2408
total response (s) 9.2408 9.2408 9.2408 9.2408
response tokens 1024 1024 1024 1024
total rate (tok/s) 110.813 110.813 110.813 110.813
stream rate (tok/s) 1023.0 1023.0 1023.0 1023.0

Cache statistics for gpt-5-nano-2025-08-07:

Total Trials: 1
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 1
Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 0.0
Avg Cache Coverage: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count
0 1

For 1 trials of gpt-5-mini-2025-08-07 @ 2026-01-05 01:54PM:

Stat Average First Minimum Maximum
latency (s) 12.3435 12.3435 12.3435 12.3435
total response (s) 17.1476 17.1476 17.1476 17.1476
response tokens 1024 1024 1024 1024
total rate (tok/s) 59.717 59.717 59.717 59.717
stream rate (tok/s) 212.9 212.9 212.9 212.9

Cache statistics for gpt-5-mini-2025-08-07:

Total Trials: 1
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 1
Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 0.0
Avg Cache Coverage: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count
0 1

For 1 trials of gpt-5-2025-08-07 @ 2026-01-05 01:54PM:

Stat Average First Minimum Maximum
latency (s) 20.1620 20.1620 20.1620 20.1620
total response (s) 20.7592 20.7592 20.7592 20.7592
response tokens 1024 1024 1024 1024
total rate (tok/s) 49.328 49.328 49.328 49.328
stream rate (tok/s) 1713.0 1713.0 1713.0 1713.0

Cache statistics for gpt-5-2025-08-07:

Total Trials: 1
Cache Hits (cached_tokens > 0): 1
Cache Misses (cached_tokens == 0): 0
Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 2432.0
Avg Cache Coverage: 97.67%

Cached Tokens Counts:

Cached Tokens Value Count
2432 1

For 1 trials of gpt-5.1-2025-11-13 @ 2026-01-05 01:54PM:

Stat Average First Minimum Maximum
latency (s) 0.5681 0.5681 0.5681 0.5681
total response (s) 7.7394 7.7394 7.7394 7.7394
response tokens 410 410 410 410
total rate (tok/s) 52.976 52.976 52.976 52.976
stream rate (tok/s) 57.0 57.0 57.0 57.0

Cache statistics for gpt-5.1-2025-11-13:

Total Trials: 1
Cache Hits (cached_tokens > 0): 1
Cache Misses (cached_tokens == 0): 0
Cache Hit Rate: 100.00%
Cache Miss Rate: 0.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 2432.0
Avg Cache Coverage: 97.67%

Cached Tokens Counts:

Cached Tokens Value Count
2432 1

For 1 trials of gpt-5.2-2025-12-11 @ 2026-01-05 01:54PM:

Stat Average First Minimum Maximum
latency (s) 0.4638 0.4638 0.4638 0.4638
total response (s) 7.2426 7.2426 7.2426 7.2426
response tokens 308 308 308 308
total rate (tok/s) 42.526 42.526 42.526 42.526
stream rate (tok/s) 45.3 45.3 45.3 45.3

Cache statistics for gpt-5.2-2025-12-11:

Total Trials: 1
Cache Hits (cached_tokens > 0): 0
Cache Misses (cached_tokens == 0): 1
Cache Hit Rate: 0.00%
Cache Miss Rate: 100.00%
Avg Prompt Tokens: 2490.0
Avg Cached Tokens: 0.0
Avg Cache Coverage: 0.00%

Cached Tokens Counts:

Cached Tokens Value Count
0 1

Replication code

…enjoy more free work ironically while overbilling API clients continues.

from __future__ import annotations

from dataclasses import dataclass
from typing import Any, Iterable
import time

import openai  # requires: pip install openai


# -----------------------------
# Test parameters
# -----------------------------
MODELS = [
    "gpt-5-nano-2025-08-07",
    "gpt-5-mini-2025-08-07",
    "gpt-5-2025-08-07",
    "gpt-5.1-2025-11-13",
    "gpt-5.2-2025-12-11",
]

TRIALS = 1
MAX_COMPLETION_TOKENS = 1024  # keep small; we care about prompt caching
SLEEP_AFTER_WARMUP_S = 5

DISPLAY_CACHE = True
DISPLAY_BENCHMARK = True


PROMPT = "Write another poem, about a puppy, notably different."

HISTORY_TEXT = """
Certainly! Here are ten different poems about kittens, each with six stanzas:

---

**1. The Playful Kitten**

In the morning light, a kitten plays,  
Chasing shadows in a sunlit haze.  
With tiny paws and a curious gaze,  
It leaps and bounds in joyful ways.

A ball of yarn becomes its prey,  
Rolling, tumbling, throughout the day.  
With every pounce, it seems to say,  
"Come join me in my playful fray."

The world is vast, a grand expanse,  
For a kitten's heart, a place to dance.  
In every corner, a new chance,  
To explore, to leap, to prance.

With whiskers twitching, ears alert,  
It scampers 'cross the grassy dirt.  
A gentle breeze lifts its furred shirt,  
As it chases leaves with a joyful spurt.

When the sun dips low, the kitten rests,  
Curled in a ball, its cozy nest.  
Dreams of adventures fill its chest,  
In slumber, it finds its peaceful best.

The night is calm, the stars aglow,  
The kitten sleeps, its breath is slow.  
In dreams, it dances to and fro,  
A playful spirit, forever aglow.

---

**2. The Curious Explorer**

A kitten peeks through the garden gate,  
With eyes so wide, it cannot wait.  
To see the world, to contemplate,  
The wonders that lie beyond its fate.

A butterfly flutters by its nose,  
The kitten leaps, its excitement grows.  
In the garden, where the wild rose,  
Blooms in colors that the kitten knows.

The grass is soft beneath its feet,  
A world of scents, a world to greet.  
With every step, a new heartbeat,  
In the garden, where adventures meet.

A rustle in the leaves, a sound,  
The kitten pauses, looks around.  
In the silence, it is spellbound,  
By the mysteries that abound.

The sun sets low, the shadows creep,  
The kitten yawns, its eyes half-sleep.  
In the garden, where secrets keep,  
It finds a place to rest and weep.

The stars come out, the moon is bright,  
The kitten dreams in the soft moonlight.  
In its heart, a gentle delight,  
For the world it explored with all its might.

---

**3. The Gentle Companion**

A kitten sits by the window pane,  
Watching the world through the falling rain.  
With eyes so soft, it feels no pain,  
In its heart, a gentle refrain.

A friend to all, with a tender purr,  
It comforts those who come to confer.  
In its presence, hearts never stir,  
For the kitten's love is a gentle blur.

With a nuzzle and a soft embrace,  
It brings a smile to every face.  
In its warmth, there's a special grace,  
A gentle love that leaves no trace.

The world outside may be cold and gray,  
But the kitten's heart is warm and gay.  
In its presence, worries fade away,  
For its love is here to stay.

As the night falls, the kitten sleeps,  
In dreams, its gentle spirit leaps.  
In its heart, a love that keeps,  
A gentle soul that never weeps.

The dawn breaks, the world awakes,  
The kitten stirs, its heart still aches.  
For the love it gives, the love it takes,  
A gentle companion, for all our sakes.

---

**4. The Mischievous Kitten**

A kitten with a twinkle in its eye,  
Plots its next adventure, oh so sly.  
With a flick of its tail, it leaps high,  
Into the world where mischief lies.

A vase on the table, a tempting sight,  
The kitten ponders, its eyes alight.  
With a gentle nudge, it takes flight,  
And crashes down in the quiet night.

A curtain sways, a perfect climb,  
The kitten scales it in no time.  
With a playful meow, a joyful chime,  
It swings and sways, a rhythmic rhyme.

A shoe becomes a hiding place,  
The kitten peeks out with a cheeky face.  
In its eyes, a mischievous trace,  
As it darts away in a playful race.

The day is done, the kitten rests,  
In dreams, it plans its next conquests.  
With a purr, it snuggles in its nest,  
A mischievous heart, forever blessed.

The night is calm, the stars are bright,  
The kitten dreams of its playful plight.  
In its heart, a mischievous light,  
A playful spirit, taking flight.

---

**5. The Adventurous Spirit**

A kitten with a heart so bold,  
Ventures forth into the world untold.  
With every step, a story unfolds,  
In the heart of a kitten, brave and bold.

The forest calls, a whispering breeze,  
The kitten answers with gentle ease.  
In the shadows, where the wild trees,  
Stand tall and proud, the kitten sees.

A river flows, a gentle stream,  
The kitten leaps, its eyes agleam.  
In the water, where the fish gleam,  
It finds a world of endless dreams.

The mountains rise, a daunting sight,  
The kitten climbs with all its might.  
In the heights, where the eagles take flight,  
It finds a world of pure delight.

The sun sets low, the world is still,  
The kitten rests on a quiet hill.  
In its heart, a gentle thrill,  
For the adventures that it will fulfill.

The stars come out, the night is clear,  
The kitten dreams without a fear.  
In its heart, a world so dear,  
An adventurous spirit, forever near.

---

**6. The Dreamer**

A kitten dreams beneath the stars,  
Of distant lands and worlds afar.  
In its heart, a gentle spar,  
A dreamer with a wishful star.

The moonlight dances on its fur,  
A gentle breeze begins to stir.  
In its dreams, the world is a blur,  
A place where dreams and wishes occur.

A castle stands on a distant hill,  
The kitten dreams of a world so still.  
In its heart, a gentle thrill,  
For the dreams that it will fulfill.

A ship sails on a moonlit sea,  
The kitten dreams of being free.  
In its heart, a gentle plea,  
For the dreams that it will see.

The stars twinkle in the night,  
The kitten dreams of a world so bright.  
In its heart, a gentle light,  
A dreamer with a wishful sight.

The dawn breaks, the world awakes,  
The kitten stirs, its heart still aches.  
For the dreams it dreams, the dreams it makes,  
A dreamer with a heart that never breaks.

---

**7. The Gentle Healer**

A kitten with a gentle touch,  
Brings comfort to those who need it much.  
With a purr and a gentle clutch,  
It heals the heart with a gentle hush.

A friend to all, with a tender heart,  
It brings warmth to those who fall apart.  
In its presence, a gentle start,  
To heal the wounds that tear apart.

With a nuzzle and a gentle purr,  
It brings peace to those who confer.  
In its warmth, a gentle blur,  
A healing touch that leaves no stir.

The world may be harsh and cold,  
But the kitten's heart is warm and bold.  
In its presence, worries unfold,  
For its love is a story untold.

As the night falls, the kitten sleeps,  
In dreams, its gentle spirit leaps.  
In its heart, a love that keeps,  
A gentle healer, for all our sakes.

The dawn breaks, the world awakes,  
The kitten stirs, its heart still aches.  
For the love it gives, the love it takes,  
A gentle healer, for all our sakes.

---

**8. The Playful Spirit**

A kitten with a playful heart,  
Leaps and bounds, a joyful start.  
In the garden, where the flowers part,  
It finds a world of playful art.

A butterfly flutters by its nose,  
The kitten leaps, its excitement grows.  
In the garden, where the wild rose,  
Blooms in colors that the kitten knows.

The grass is soft beneath its feet,  
A world of scents, a world to greet.  
With every step, a new heartbeat,  
In the garden, where adventures meet.

A rustle in the leaves, a sound,  
The kitten pauses, looks around.  
In the silence, it is spellbound,  
By the mysteries that abound.

The sun sets low, the shadows creep,  
The kitten yawns, its eyes half-sleep.  
In the garden, where secrets keep,  
It finds a place to rest and weep.

The stars come out, the moon is bright,  
The kitten dreams in the soft moonlight.  
In its heart, a gentle delight,  
For the world it explored with all its might.

---

**9. The Gentle Guardian**

A kitten with a watchful eye,  
Guards the home where loved ones lie.  
With a gentle purr and a soft sigh,  
It watches over with a loving eye.

A friend to all, with a tender heart,  
It brings warmth to those who fall apart.  
In its presence, a gentle start,  
To guard the hearts that tear apart.

With a nuzzle and a gentle purr,  
It brings peace to those who confer.  
In its warmth, a gentle blur,  
A guardian's touch that leaves no stir.

The world may be harsh and cold,  
But the kitten's heart is warm and bold.  
In its presence, worries unfold,  
For its love is a story untold.

As the night falls, the kitten sleeps,  
In dreams, its gentle spirit leaps.  
In its heart, a love that keeps,  
A gentle guardian, for all our sakes.

The dawn breaks, the world awakes,  
The kitten stirs, its heart still aches.  
For the love it gives, the love it takes,  
A gentle guardian, for all our sakes.

---

**10. The Joyful Kitten**

A kitten with a joyful heart,  
Leaps and bounds, a joyful start.  
In the garden, where the flowers part,  
It finds a world of joyful art.

A butterfly flutters by its nose,  
The kitten leaps, its excitement grows.  
In the garden, where the wild rose,  
Blooms in colors that the kitten knows.

The grass is soft beneath its feet,  
A world of scents, a world to greet.  
With every step, a new heartbeat,  
In the garden, where adventures meet.

A rustle in the leaves, a sound,  
The kitten pauses, looks around.  
In the silence, it is spellbound,  
By the mysteries that abound.

The sun sets low, the shadows creep,  
The kitten yawns, its eyes half-sleep.  
In the garden, where secrets keep,  
It finds a place to rest and weep.

The stars come out, the moon is bright,  
The kitten dreams in the soft moonlight.  
In its heart, a gentle delight,  
For the world it explored with all its might.

---

Feel free to use these poems!
""".strip()

HISTORY: list[dict[str, Any]] = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": (
                    "You are a creative author. "
                    "You are powered by an AI model with a 128k context window for responses."
                ),
            }
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Write ten different poems about kittens, each with six stanzas.",
            }
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": HISTORY_TEXT,
            }
        ],
    },
]


# -----------------------------
# Benchmark plumbing
# -----------------------------
@dataclass(frozen=True, slots=True)
class CallMetrics:
    latency_s: float
    total_s: float
    prompt_tokens: int
    completion_tokens: int
    cached_prompt_tokens: int

    @property
    def total_rate(self) -> float:
        if self.total_s <= 0:
            return 0.0
        return self.completion_tokens / self.total_s

    @property
    def stream_rate(self) -> float:
        streamed_tokens = max(self.completion_tokens - 1, 0)
        stream_time = max(self.total_s - self.latency_s, 0.0)
        return streamed_tokens / (stream_time if stream_time > 0 else 1.0)


def now_str() -> str:
    return time.strftime("%Y-%m-%d %I:%M%p", time.localtime(time.time()))


def safe_int(value: Any) -> int:
    try:
        return int(value)
    except Exception:
        return 0


def extract_cached_tokens(usage: Any) -> int:
    details = getattr(usage, "prompt_tokens_details", None)
    if details is None:
        return 0
    return safe_int(getattr(details, "cached_tokens", 0))


def run_streamed_call(
    *,
    client: openai.Client,
    model: str,
    messages: list[dict[str, Any]],
    max_completion_tokens: int,
    show_progress: bool,
) -> CallMetrics:
    start = time.perf_counter()
    first_content_at: float | None = None
    usage_obj: Any | None = None

    stream = client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
        max_completion_tokens=max_completion_tokens,
        stream_options={"include_usage": True},
    )

    for chunk in stream:
        if getattr(chunk, "usage", None) is not None:
            usage_obj = chunk.usage

        choices = getattr(chunk, "choices", None)
        if not choices:
            continue

        delta = getattr(choices[0], "delta", None)
        if delta is None:
            continue

        content = getattr(delta, "content", None)
        if not content:
            continue

        if first_content_at is None:
            first_content_at = time.perf_counter()

        if show_progress:
            print(".", end="", flush=True)

    end = time.perf_counter()

    latency_s = (first_content_at - start) if first_content_at is not None else (end - start)
    total_s = end - start

    prompt_tokens = safe_int(getattr(usage_obj, "prompt_tokens", 0)) if usage_obj else 0
    completion_tokens = safe_int(getattr(usage_obj, "completion_tokens", 0)) if usage_obj else 0
    cached_prompt_tokens = extract_cached_tokens(usage_obj) if usage_obj else 0

    return CallMetrics(
        latency_s=round(latency_s, 4),
        total_s=round(total_s, 4),
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        cached_prompt_tokens=cached_prompt_tokens,
    )


def run_models_phase(
    *,
    client: openai.Client,
    models: Iterable[str],
    messages: list[dict[str, Any]],
    max_completion_tokens: int,
    label: str,
    show_progress: bool,
) -> dict[str, CallMetrics | None]:
    results: dict[str, CallMetrics | None] = {}
    for model in models:
        print(f"\n{label} for {model}", end="", flush=True)
        try:
            metrics = run_streamed_call(
                client=client,
                model=model,
                messages=messages,
                max_completion_tokens=max_completion_tokens,
                show_progress=show_progress,
            )
            print(" done.")
            results[model] = metrics
        except Exception as e:
            print(f" error: {e}")
            results[model] = None
    return results


def stats_row(values: list[float]) -> tuple[float, float, float, float] | None:
    if not values:
        return None
    first = values[0]
    return (sum(values) / len(values), first, min(values), max(values))


def print_reports(results: dict[str, list[CallMetrics]]) -> None:
    stamp = now_str()

    for model, calls in results.items():
        if DISPLAY_BENCHMARK:
            print(f"### For {len(calls)} trials of {model} @ {stamp}:")
            print("| Stat | Average | First | Minimum | Maximum |")
            print("| --- | ---: | ---: | ---: | ---: |")

            latency = [c.latency_s for c in calls]
            total = [c.total_s for c in calls]
            completion_tokens = [float(c.completion_tokens) for c in calls]
            total_rate = [c.total_rate for c in calls]
            stream_rate = [c.stream_rate for c in calls]

            for name, values, fmt in [
                ("latency (s)", latency, "{:.4f}"),
                ("total response (s)", total, "{:.4f}"),
                ("response tokens", completion_tokens, "{:.0f}"),
                ("total rate (tok/s)", total_rate, "{:.3f}"),
                ("stream rate (tok/s)", stream_rate, "{:.1f}"),
            ]:
                row = stats_row(values)
                if row is None:
                    print(f"| {name} | N/A | N/A | N/A | N/A |")
                    continue
                avg, first, mn, mx = row
                print(
                    f"| {name} | {fmt.format(avg)} | {fmt.format(first)} | {fmt.format(mn)} | {fmt.format(mx)} |"
                )
            print()

        if DISPLAY_CACHE:
            cached = [c.cached_prompt_tokens for c in calls]
            prompt = [c.prompt_tokens for c in calls]

            cache_hits = sum(1 for t in cached if t > 0)
            cache_misses = sum(1 for t in cached if t == 0)
            total_trials = len(cached)

            avg_prompt = (sum(prompt) / len(prompt)) if prompt else 0.0
            avg_cached = (sum(cached) / len(cached)) if cached else 0.0
            avg_cov = (100.0 * avg_cached / avg_prompt) if avg_prompt > 0 else 0.0

            print(f"### Cache statistics for {model}:")
            print(f"Total Trials: {total_trials}")
            print(f"Cache Hits (cached_tokens > 0): {cache_hits}")
            print(f"Cache Misses (cached_tokens == 0): {cache_misses}")
            if total_trials > 0:
                print(f"Cache Hit Rate: {(cache_hits / total_trials) * 100:.2f}%")
                print(f"Cache Miss Rate: {(cache_misses / total_trials) * 100:.2f}%")
            print(f"Avg Prompt Tokens: {avg_prompt:.1f}")
            print(f"Avg Cached Tokens: {avg_cached:.1f}")
            print(f"Avg Cache Coverage: {avg_cov:.2f}%")
            print()

            token_counts: dict[int, int] = {}
            for t in cached:
                token_counts[t] = token_counts.get(t, 0) + 1

            print("#### Cached Tokens Counts:")
            print("| Cached Tokens Value | Count |")
            print("| ---: | ---: |")
            for tokens_value in sorted(token_counts.keys()):
                print(f"| {tokens_value} | {token_counts[tokens_value]} |")
            print()


def main() -> None:
    client = openai.Client(timeout=120, max_retries=0)  # Uses OPENAI_API_KEY env var

    messages = HISTORY + [{"role": "user", "content": PROMPT}]

    run_models_phase(
        client=client,
        models=MODELS,
        messages=messages,
        max_completion_tokens=MAX_COMPLETION_TOKENS,
        label="Warmup",
        show_progress=True,
    )

    if SLEEP_AFTER_WARMUP_S > 0:
        print("\nWaiting for cache to be available...")
        time.sleep(SLEEP_AFTER_WARMUP_S)
        print("Proceeding to trials.\n")

    results: dict[str, list[CallMetrics]] = {m: [] for m in MODELS}
    for i in range(TRIALS):
        phase = run_models_phase(
            client=client,
            models=MODELS,
            messages=messages,
            max_completion_tokens=MAX_COMPLETION_TOKENS,
            label=f"Trial {i + 1}",
            show_progress=True,
        )
        for model, metrics in phase.items():
            if metrics is not None:
                results[model].append(metrics)

    print()
    print_reports(results)


if __name__ == "__main__":
    main()
2 Likes

I’ll add my input by linking two more recent independent reports on the same issue. Hope this helps pinpoint the root cause!

4 Likes

Hi @vb , my caching issue was a prompt engineering bug, so no issues on my side. Will report in the thread.

1 Like