Is there a way to disable prompt caching in the APIs

confi13 · October 9, 2024, 6:20pm

I am seeing inconsistent results with structured outputs. For the first couple of turns, the cache is 0 & output is as expected. When the prompt gets cached, the data does not adhere to the format and its missing fields. Is there a way to disable the prompt cache?

anon10827405 · October 9, 2024, 7:16pm

This sounds like an edge case. Do you mind explaining more? OpenAI staff sometimes swing by and it may be something interesting.

As far as I know there’s no way to toggle the prompt caching. However, I believe the structured prompt schemas are also kind of cached as well, soooo… Gets complicated

sommestad · October 11, 2024, 9:58am

Hey. Have the same question. OpenAI doesn’t always respond according to the specified structured JSON schema (sometimes its reply is the actual schema, instead of the response).

In these cases, I’d like to retry the request to get it to respond properly. But, the newly introduced cache seems to just keep spitting out the same invalid response over and over, causing infinite requests to the API.

Would love to be able to override the cache in certain scenarios.

_j · October 11, 2024, 10:06am

Changing ANYTHING in the first 1024 tokens should defeat the cache.

Just add a random nonce to the system prompt. Or something useful, like the current time and date of the latest request.

sommestad · October 11, 2024, 11:45am

Thanks, good idea. Would it also work to apply something at the end? Like a new “user” message telling it to be careful with output formatting.

_j · October 12, 2024, 11:08pm

Cache is used starting from the beginning of input context, only when everything starting from the beginning of input is the same.

Imagine if you send a chat with system message, tools, and chat history totaling 2000 tokens. The first chat turn of user input is at the 1200 token point.

That 2000 tokens will all form a stored cache, where after the 1024 token point, there will be further “chunks” of cache state captured every 128 tokens as the input computation is built. 2000/128 = 15 complete cache chunks, 1920 tokens.

You quickly send another appended message with that same chat history, and all 1920 tokens should be employed.

If you were to make alterations mid-state, such as deleting the oldest message to manage your budget, you would have different stream after the 1200 token point, and only 9 chunks = 1152 could match.

So continuing a growing conversation in near-realtime (timeout 5-60 minutes) where you are adding to the end of the chat session without further management, is the ideal use of cache, not a way to disable it. You have to “break” the cached pattern earlier in the input.

The ultimate problem with the original post may simply be the long conversation that distracts from a clear mission and schema that was set forth early in the chat system message. You might have told the AI exactly how to respond in the system message, but within five chat exchanges also being sent, it is back to plain old annoying ChatGPT-like behavior.

Another thought: this may be why OpenAI will simply terminate a ChatGPT session if it grows too long now. A growing conversation has a precomputed state that is almost free to re-run when it is appended to, especially if this is a database-stored state rather than API’s timeout. Truncating the start of chat, though, is a full reset and full expense again.

confi13 · October 14, 2024, 4:37pm

For your comment on falling back to plain old annoying ChatGPT-like behavior - that may be the case. Below, I have added background on why I thought Cache could be the suspect. So, for instances where chatGPT loses its rule set in the System prompt - what strategies have worked to keep it true to the system prompt expectation? One suggestion I have read is to refresh the system prompt every 5-10 conversations. That seems heavy-handed…but doable. Any other suggestions?

Thanks for that clarification on the cache. In my case, I am using the classic append the new messages at the end - leaving everything including system prompt and previous conversation, cacheable. The symptom I saw was that the responses were deviating from the rules I setup in system prompt and that started to happen out of no where and the timing was suspect of caching default on – so wanted to test disabling the cache and eliminate that as a variable. (I like your ideas on how to effectively invalidate cache).

confi13 · October 14, 2024, 4:40pm

This did work for me. Effectly after every user message I add a instruction to ensure the output is formatted in json

metrosharesolutions · October 22, 2024, 3:00pm

Does changing the temperature affect the cache?

Something I started doing a while back is adding a random fraction of a decimal value to the temperature in the hopes of preventing an echo chamber. Wondering if that might help in this type of situation.

john93 · April 24, 2025, 2:19am

Necro-posting here. I’ve seen a similar issue - structured outputs generated with a mangled schema.

My initial approach was to do a retry if the returned JSON doesn’t validate in my pydantic model, and I bumped up the temperature and set a random seed on each try.

Lo-and-behold, even with changed temp and seed I still get the broken, cached response. This is far from ideal, it seems a bit of an oversight to not include these as part of the cache key?

Pre-inserting a random nonce works, but it’s a bit of a hack

Topic		Replies	Views
Is this a problem with cached tokens? API gpt-4 , prompt-caching	3	1064	October 10, 2024
Prompt Caching Not Applied When Schema Changes Bugs	3	385	October 9, 2024
How does Prompt Caching work? Prompting api , prompt-caching	8	3816	October 29, 2024
ChatGPT API - Avoiding Same Results API	4	2631	March 9, 2023
Prompt Caching Hierarchy with Structured Outputs Feedback	5	612	October 9, 2024

Is there a way to disable prompt caching in the APIs

Related topics