Is there a way to disable prompt caching in the APIs

I am seeing inconsistent results with structured outputs. For the first couple of turns, the cache is 0 & output is as expected. When the prompt gets cached, the data does not adhere to the format and its missing fields. Is there a way to disable the prompt cache?

This sounds like an edge case. Do you mind explaining more? OpenAI staff sometimes swing by and it may be something interesting.

As far as I know there’s no way to toggle the prompt caching. However, I believe the structured prompt schemas are also kind of cached as well, soooo… Gets complicated :sweat_smile:

2 Likes

Hey. Have the same question. OpenAI doesn’t always respond according to the specified structured JSON schema (sometimes its reply is the actual schema, instead of the response).

In these cases, I’d like to retry the request to get it to respond properly. But, the newly introduced cache seems to just keep spitting out the same invalid response over and over, causing infinite requests to the API.

Would love to be able to override the cache in certain scenarios.

Changing ANYTHING in the first 1024 tokens should defeat the cache.

Just add a random nonce to the system prompt. Or something useful, like the current time and date of the latest request.

1 Like

Thanks, good idea. Would it also work to apply something at the end? Like a new “user” message telling it to be careful with output formatting.

1 Like

Cache is used starting from the beginning of input context, only when everything starting from the beginning of input is the same.

Imagine if you send a chat with system message, tools, and chat history totaling 2000 tokens. The first chat turn of user input is at the 1200 token point.

That 2000 tokens will all form a stored cache, where after the 1024 token point, there will be further “chunks” of cache state captured every 128 tokens as the input computation is built. 2000/128 = 15 complete cache chunks, 1920 tokens.

You quickly send another appended message with that same chat history, and all 1920 tokens should be employed.

If you were to make alterations mid-state, such as deleting the oldest message to manage your budget, you would have different stream after the 1200 token point, and only 9 chunks = 1152 could match.

So continuing a growing conversation in near-realtime (timeout 5-60 minutes) where you are adding to the end of the chat session without further management, is the ideal use of cache, not a way to disable it. You have to “break” the cached pattern earlier in the input.


The ultimate problem with the original post may simply be the long conversation that distracts from a clear mission and schema that was set forth early in the chat system message. You might have told the AI exactly how to respond in the system message, but within five chat exchanges also being sent, it is back to plain old annoying ChatGPT-like behavior.


Another thought: this may be why OpenAI will simply terminate a ChatGPT session if it grows too long now. A growing conversation has a precomputed state that is almost free to re-run when it is appended to, especially if this is a database-stored state rather than API’s timeout. Truncating the start of chat, though, is a full reset and full expense again.

1 Like

For your comment on falling back to plain old annoying ChatGPT-like behavior - that may be the case. Below, I have added background on why I thought Cache could be the suspect. So, for instances where chatGPT loses its rule set in the System prompt - what strategies have worked to keep it true to the system prompt expectation? One suggestion I have read is to refresh the system prompt every 5-10 conversations. That seems heavy-handed…but doable. Any other suggestions?

Thanks for that clarification on the cache. In my case, I am using the classic append the new messages at the end - leaving everything including system prompt and previous conversation, cacheable. The symptom I saw was that the responses were deviating from the rules I setup in system prompt and that started to happen out of no where and the timing was suspect of caching default on – so wanted to test disabling the cache and eliminate that as a variable. (I like your ideas on how to effectively invalidate cache).

This did work for me. Effectly after every user message I add a instruction to ensure the output is formatted in json

1 Like

Does changing the temperature affect the cache?

Something I started doing a while back is adding a random fraction of a decimal value to the temperature in the hopes of preventing an echo chamber. Wondering if that might help in this type of situation.