Here, I recently went through this ‘Input cached tokens’ in my chat completions api…
What is the cached tokens exactly? Is the api caching the entire input tokens for retaining the context in every request? or is it caching the input prompt tokens alone??
I haven’t read exactly how it works at OpenAI, but I imagine it functions similarly to DeepSeek, where I’ve studied more details. It seems that during the “INPUT digestion” performed by the LLM model (or perhaps in an earlier tokenization layer?), there’s an initial pre-processing stage that doesn’t need to be repeated every time a new user or assistant message is added WITHIN THE SAME CONVERSATION, because this “pre-processed” information can be preserved (this is what gets cached).
As a result, especially in longer conversations, you notice a significant boost in response speed and processing efficiency! So much so that some providers charge HALF THE PRICE for “tokens retrieved from the cache”!!!
To put it another way, in non-technical terms: these “LLM systems” avoid re-processing (is it tokenization?) the INPUT EVERY TIME and instead efficiently check if the start of what you send in each API call has already been sent before, retrieving the previously processed portion. This way, in such calls, the initial part of the “INPUT tokens” is fetched from cache, while new tokens are generated for the first time and then added to the cache for the next call.
I, too, was left scratching my head the first time I heard about cached input tokens, hehehe.
I assume that in the future, providers (i.e., OpenAI, DeepSeek, etc.) will standardize pricing and average the value of INPUT tokens. In fact, many providers are already SIMPLIFYING pricing by standardizing the value of INPUT and OUTPUT tokens. In my opinion, they ought to unify and streamline these calculations unless they want the system to become frustratingly impractical. Honestly, the same goes for ChatGPT’s user interface-it’s getting more convoluted by the day instead of simpler. Don’t even get me started on usage limit policies!
Especially the largest providers (and OpenAI is THE largest) should calculate their costs and offer balanced, straightforward pricing and easy-to-understand terms. Frankly, that’s my humble opinion. I don’t know who’s in charge of these “details” at the company, but they seem like total geeks. They need a STRONG marketing and PRODUCT lead to smooth out these issues.