One thing that’s kind of weird is that OpenAI only applies token caching if your prompt is at least 1024 tokens in length.
Since cached tokens are half the cost of non-cached tokens, that means any prompt that is between 513 and 1023 tokens will cost more than if you stuff your prompt to be 1024 tokens. (Assuming you’re calling into the API frequently enough for the cache to remain in play.)
The closer your prompt is to 1024 tokens in length, the more you could save by padding your prompt to 1024 tokens in length.
Other than offering token caching below 1024 tokens, I’m not sure what else OpenAI could do to prevent the need for this kind of gaming, but in the meantime, has anyone done this, and anyone have recommendations for how best to pad things that won’t impact (or might even improve) the completion results?
I’m sure your activation of context cache by special techniques would be welcomed, not discouraged.
Anthropic discounts their auto-cache by 90%. It is tokens billed that do not require an AI model to operate on them, only a backing system (although the actual computation cost of any particular output token would be an impenetrable formula of accounting for the position).
That doesn’t stop the fact that context length itself can be a distraction to quality.
You would also need to anticipate when it would be wasteful to target, by typical user or application behavior and the retention being as little as five minutes. You could be sending large inputs that are never reused.
An ideal situation would be a chat session that has continued past the minimum length of context cache in actual user chat turns. The timeout means that only continued engagement would be cached. This requires that only additional messages are being appended, without changes to placed functions or early placement of RAG.
That also means that initially padding a system prompt (perhaps just repeating the same message, or crafting more quality in other parts such as function schema definitions) could not be removed at some later point without a non-cache-able break.
If you were to switch padded to unpadded system prompt, it could be at a point where you discard a large portion of stale user chat to again bring API calls within a budget.
I think this is unnecessary, as one AI output will likely blow past 1024 tokens (if not the user input alone), meaning merely a second chat input would build a cache.
In a batch, you can more accurately target the caching, and allow a few seconds before the repeated calls begin.