In a dialogue with gpt-4o using the OpenAI API it would be cost and latency advantageous to cache a system prompt that would incorporate a large document that would be advised by the llm each time it would respond to a question . The caching would take place on OpenAI side and would turn the interaction to statefull. Any ideas?
you can do this with assistants api. once you setup the system prompt(instructions), knowledge files (vector store) , added tools, created thread, afterwards you basically just send one message each time you interact with it. but of course, you still pay for all the tokens used. but you are only sending one message.
All of these services use similar libraries to implement the transformer used in their models. We’re starting to see other services like Gemini and Claude add caching support so it’s likely just a matter of time before OpenAI adds a similar feature.
UPDATE:
Right after responding to this I stumbled upon interview with Jeremy Howard (Answer.ai and Fast.ai) and he was discussing the same topic. Jeremy suspects we’re going to start seeing all of the model providers add some form of KV Caching. It’s just too big of a perf boost to not offer KV Caching in some form.