Does prompt caching have anything to do with the development of the AI model itself, or is it for profitability purposes that o3-mini isn’t being offered with caching?
As far as I know from the chatbots there’s basic RAG or KV querying technics that could give these some sort of caching capabilities to chat. Open AI already has the technology established for the other models.
So at this cash burning rate I’m questioning my choices
400K input, 15K output => 50 cents already in 10 messages back and forth…
Piles up quickly, you see…
Whoops, I apologize for the new topic. You’re right. It’s probably that Roo code extension doesn’t follow the pricing structure and gave me the uncached prices per usage. Sorry, my bad.
There is another aspect to consider: You will only receive a cached discount on the initial input context when:
Input messages are losslessly chat-like, identical to a previous call, from the start;
the inference reaches the same server, prioritized but not guaranteed;
the “identical” portion is greater than 1024 tokens, tokenized in the same manner;
the cache has both been established (not paralllel calls) and is not expired (5-60 minutes).
The proportion paid on unrepeatable reasoning output can be far higher than input you actually resend and grow where a cache discount can be activated. For example, here is 48 tokens “in”, 3000 tokens “out” of o3-medium: