How to improve caching accuracy

Currently, I’m having an application that sending around 500 api requests to openAI everyday. There are around 10 types of requests and each type have a constant “System” tagged prompt.

In my request, I separate the system prompt and user request, placing the system prompt first. I’m pretty sure the system prompt is long and more than 1024 tokens since most of the time, even if the user request is really short, the input tokens is around 2k tokens.

However, my cache hit accuracy is almost 0%, if not completely 0. The cache only works if 2 request is exactly the same, even if I modify a tiny bit in user request, it makes the cache miss. Is there any way to improve this? My goal is just to reduce my api cost. I’m using o4-mini btw.

I recommend reading the prompt caching guide, it has some valuable tips.

But basically, the first 1024 tokens of your input + user parameter must be identical and the second request must occur in less than 5 minutes (at lower demand hours it may remain longer times, up to 1 hour).

Sometimes it seems to take a bit to take effect, so 2 consecutive fast prompts might get through before caching.

If you have a large system message, it may be interesting to make a separate test with it and monitor the cached tokens to find any unnoticed details that might be breaking your caching.

One detail that might break the cache is changing the instructions parameter midway, as it may change the order of the first 1024 tokens hash. But since you are using it directly as a system role it should be ok.

1 Like