Why does prompt caching requires at least 1024 tokens?

The document says as below. I am wondering why is this limit required?

API calls to supported models will automatically benefit from Prompt Caching on prompts longer than 1,024 tokens.

1 Like

OpenAI doesn’t publicly explain “why,” but if the prompt less than 1024, the potential savings (in latency + compute) might be too small to justify the overhead used for caching. Likely: the 1,024-token rule is a practical engineering cutoff. It keeps the caching system efficient, fair, and worth the effort only when it actually saves time and money.

2 Likes

Thanks Paul, but my understanding is that this cache is the content of the KV cache which I thought is always created during the prefill phase of inferencing irrespective of the size of the prompt. May be I have some gaps in my understanding.

If your input is “A helpful AI/hello” - which is going to be faster: hashing that, looking up a local context window cache, loading the hidden state and embedding to GPU of a previous run for a resume – or simply to run the AI on 10 tokens of input?

The size of the model and its embeddings and generation rate would give variables to budgeting latency and GPU compute optimization. Simply giving you a token value is a predictable discount cost that would hide proprietary methods. (not predictable is when they discount less than they should, by 128, 192, 256 or more tokens).

1 Like