I experience the same behavior:
-
I had to increase the size of my initial prompt to 1200 tokens so that the caching started to work, and my next round-trip to gpt-4o-2024-08-06 returned a chat completion object with “prompt_tokens_details”: {“cached_tokens”: 1024}. It looks like gpt-4o tokenizer and gpt-4o-2024-08-06 use different encodings although according to docs they both should work with the same o200k_base.
Cached prefixes generally remain active for 5 to 10 minutes of inactivity. However, during off-peak periods, caches may persist for up to one hour.
In most cases, the prompt is actually cached. And I see in logs that even new sessions within 10 minutes reuse the cache. But sometimes within the same session the initial prompt is not cached, and the next round-trip within 30 seconds doesn’t pick up the cache. I don’t understand why.