Prompt caching doesn't seem to work regularly


I am trying to maintain multi-turn conversation by appending the user questions and the assistant response in an incremental manner in the loop, using gpt-4o, but the usage logs are trying to say something else, it seems as if the amount of cached tokens taking a hit is pretty irregular. Please can anybody help…!!!

Prompt caching depends basically on:

  • The hash of the [first 1024 input tokens] + [user parameter] must match
  • It has to hit in less than 5 minutes interval (up to 1 hour depending on service demands)
  • It may have a little delay to start taking effect
  • For more details, there is a guide for prompt caching.
2 Likes

Thanks a lot for the guide, I was wondering this - if it depends on the hash, I mean it is not always likely to hit in the same machine where in the previous step I might have landed, right? that’s why it is not taking a hit always, and in the middle sometimes, it just lands up at the correct machine, right? Just wanted to confirm…

In the guide it says:

  1. Cache Routing:
  • Requests are routed to a machine based on a hash of the initial prefix of the prompt. The hash typically uses the first 256 tokens, though the exact length varies depending on the model.
  • If you provide the user parameter, it is combined with the prefix hash, allowing you to influence routing and improve cache hit rates. This is especially beneficial when many requests share long, common prefixes.
  • If requests for the same prefix and user combination exceed a certain rate (approximately 15 requests per minute), some may overflow and get routed to additional machines, reducing cache effectiveness.

To add, I’ve found that some people send dynamic objects (such as a class with a timestamp factory) to the endpoint, which ruins any chance of prompt caching. Not the case here as it looks to be intermittent. Just pointing out that sometimes the hash can unexpectedly change

It would be nice if OpenAI - or anybody really - released a simple helper function that returned the hash of the tokens.