does anyone know how KV-caching works with the OAI API calls? I assume when using ChatGPT they are saving a ton on compute by caching KV matrices but when using the API instead it’s less clear what they do in the backend.
my hunch is that if you use threads with the API then they handle this automatically to reduce their costs / your latency but if you don’t use threads they treat each API call as an entirely new query. But mainly hoping to confirm my hypothesis here.
To provide more context on why I care: if I alternate sending prompts between GPT-4 and Claude Haiku then as the context grows does OpenAI automatically stop trying store a KV cache? Would my costs and latency go up?
Each API call is its own entity, even when initiated by assistants, which still is employing the same API model, except with a framework of their code instead of your code that loads the context and catches and returns its own tool calls without external interaction.
There is certainly opportunity for precomputation of states. Why should OpenAI recompute “You are ChatGPT” a million times an hour?
API calls don’t give much surface where the overhead of storage and retrieval could be worth it. They start anew at instructions each time. And attention whether my first input is “from apple documentation…” or “from banana documentation” is immediately different.
One can speculate, even probe with timing attacks, but the API models are a black box of secrets where simply language comes out to fulfill your input.
Yeah, I figure only an employee of OpenAI can answer this one. They mention that some prompts are more KV-cache efficient in their docs so they definitely do something but ultimately need an inside developer to confirm.