Prompt caching - how many prompts are cached?

Supposing I alternate between two or more prompts, such that every second (or third, etc.) prompt shared the same prefix. Will I get any cache hits? Or do they need to be sequential?

How many prefixes are cached per org, and how many per user?

Thanks.

1 Like

This would take a series of calls just as you describe to figure out, as the behavior is not documented in such a case.

Scenario 1:

setup request 1: 8000 tokens common
setup request 2: 1500 tokens common (or a bit more, activating cache)

Scenario 2:

setup request 1: 1500 tokens common
setup request 2: 8000 tokens, growing

We can see in the second scenario, we easily expect the 8000 tokens to be the new “cache” that is available, following the pattern of chat.

However, can a smaller cache-based request reset the hits, delivering smaller than those 8000 tokens expected?

Trials of exactly that could answer. You can answer for yourself and others.

The user parameter is not for separating caches distinctly by and per user, whether API key user or end user. It is just used as part of the hashing that can distribute load more evenly across servers for dissimilar requests. You sending a different user ID is equivalent to saying “I don’t expect this to match other caching”.

1 Like

I don’t think there is a limit, but it depends on what you are doing.

It was published recently a series of details on how prompt caching works, it is worth reading.

I’ve tested several concurrent threads with no problem caching.
It works particularly well if you are using previous_response_id for chained prompts.

There is no guarantee though how many attempts it takes to start, it can take a few tries or have a small delay in some cases.

It is worth running some experiments to find what strategy fits you better.