In_memory vs 24h caching; help please

Hi all, I’m new to the API and would appreciate some clarification, I did check the documents but apparently still don’t quite understand.
I’m running a GPT as an agent through the Letta ADE but using my OAI API key so I’m being billed by OAI. At first I was using in_memory caching but because I’m the only user of this agent and I generally talk to it throughout the day or during long problem solving sessions (ie. there are sometimes more than half an hour or an hour pauses between my prompts) I assumed switching to 24h caching would be better for cost management.
However, since I switched form in_memory caching to 24h, my caching rate has dropped dramatically and my input tokens have skyrocketed.
Have I misunderstood how the two caching methods work and which one is better for my use case? Any help on this would be appreciated, thank you.

1 Like

Conversation history or some feature of an API-using software product is not a “cache”.

Context window caching is an API backend feature:

If you run the exact same start sequence in a set of messages, greater than 1024+ tokens, you can get a discount on the reused precomputation of prefill…when it is the same and verbatim. This was server-based and short term when first provided as an automatic feature, but with gpt-5.1 (only?) they give you an explicit option to specify a longer retention, perhaps by database lookup of a kv cache.

If you are using some external “keep a memory” system, and it is changing the messages context or altering initial message contents on you, that will damage the hit rate. The term “in_memory” is not part of the API, and is likely some “memory block” of the product you describe that is not actually cache-aware or is damaging the faithfulness of repeated inputs.

Umm no I’m literally reading the OAI site about prompt caching ( Prompt caching | OpenAI API ) it specifies which models extended caching is allowed for and that the two values for prompt_cache_retention are in_memory and 24h. Those two modes are what I’m talking about.
Letta sends system instructions +tooling first, then chat history. The only part of the sent info that should normally be changing is the last message that I type, not the initial (around 6k) system instructions. Which is why I don’t understand how short term caching worked but 24h caching seems to fail with the exact same setup

2 Likes

Thanks for sharing the document; I wasn’t aware there was an enum also for the default, as I don’t waste bandwidth getting avoidable parameter echoes by using Responses.

There’s some rather wishy-washy language in the “prompt caching” document that doesn’t make much sense despite their looking to elevate the technical description: “Only the key/value tensors may be persisted in local storage; the original customer content, such as prompt text, is only retained in memory.” -neither should care about the integer sequence used to make the precache step intervals directly; -both must be context-aware in order to know a compatible resume point to continue from, such as by hash table lookup.

One thing not stated as a pattern but should be apparent: pass “24h” in all requests; prompt_cache_retention is likely not just a “please store”, but also a “please consume from”.

The only thing peculiar is your report: “my input tokens have skyrocketed”. Perhaps you mean that your costs and billings have skyrocketed?

I would record and characterize the “usage” response objects you get for “input/cached” for a period of use, with the duration between requests, and the input tokens that you measure to be in common and unaltered between the first and the follow-up. Then try the opposite for a while.

Also note the crossing of UTC dates between requests: OpenAI encodes today’s date and changes your context on you. They’ll break the cache in other ways too: add an image, 200 tokens of guardrail text are added to their system message; add a file search, a message before the latest input that continues to rotate forward breaks caching of that newest message and caching of the tool retrieval that is persisted in chat history.

Significant evidence that one method is not working right and not meeting expectations for an API organization, and making OpenAI believe it, is where action and repair can actually take place.

2 Likes