Cached Input Tokens in Chat Completions

ashwinthandu03 · April 23, 2025, 6:01am

Here, I recently went through this ‘Input cached tokens’ in my chat completions api…
What is the cached tokens exactly? Is the api caching the entire input tokens for retaining the context in every request? or is it caching the input prompt tokens alone??

caos30 · April 30, 2025, 1:15am

I haven’t read exactly how it works at OpenAI, but I imagine it functions similarly to DeepSeek, where I’ve studied more details. It seems that during the “INPUT digestion” performed by the LLM model (or perhaps in an earlier tokenization layer?), there’s an initial pre-processing stage that doesn’t need to be repeated every time a new user or assistant message is added WITHIN THE SAME CONVERSATION, because this “pre-processed” information can be preserved (this is what gets cached).

As a result, especially in longer conversations, you notice a significant boost in response speed and processing efficiency! So much so that some providers charge HALF THE PRICE for “tokens retrieved from the cache”!!!

To put it another way, in non-technical terms: these “LLM systems” avoid re-processing (is it tokenization?) the INPUT EVERY TIME and instead efficiently check if the start of what you send in each API call has already been sent before, retrieving the previously processed portion. This way, in such calls, the initial part of the “INPUT tokens” is fetched from cache, while new tokens are generated for the first time and then added to the cache for the next call.

I, too, was left scratching my head the first time I heard about cached input tokens, hehehe.

I assume that in the future, providers (i.e., OpenAI, DeepSeek, etc.) will standardize pricing and average the value of INPUT tokens. In fact, many providers are already SIMPLIFYING pricing by standardizing the value of INPUT and OUTPUT tokens. In my opinion, they ought to unify and streamline these calculations unless they want the system to become frustratingly impractical. Honestly, the same goes for ChatGPT’s user interface-it’s getting more convoluted by the day instead of simpler. Don’t even get me started on usage limit policies!

Especially the largest providers (and OpenAI is THE largest) should calculate their costs and offer balanced, straightforward pricing and easy-to-understand terms. Frankly, that’s my humble opinion. I don’t know who’s in charge of these “details” at the company, but they seem like total geeks. They need a STRONG marketing and PRODUCT lead to smooth out these issues.

Topic		Replies	Views
How to save input tokens in Responses API? API responses	5	268	May 23, 2025
How does Prompt Caching work? Prompting api , prompt-caching	8	4126	October 29, 2024
Prompt Token Cache Gaming to Save Money? API prompt-caching	1	644	October 18, 2024
Identical request input results in different input token counts in the dashboard API token	11	528	October 15, 2024
Help me understand the realtime usage block API realtime	4	645	December 18, 2024

Cached Input Tokens in Chat Completions

Related topics