Reducing cost of Realtime API by caching tool reaponses

Just sharing one idea for how to potentially reduce costs of using the Realtime API. If you have tool calls that generate a limited set of responses you can potentially record the generated audio the first time you see a given response and then playback that recording for future occurrences.

Basically if you have a bunch of users all asking “what’s the current weather in Seattle?” Why would you pay to generate the same basic response over and over again.

This obviously won’t work for personalized responses but I think it could work well for tool calls where the assistant is asking you to take over and do something anyway. You would just need to patch the conversation history in addition to playing back the cached recording.

You can’t inject audio into the conversation history but you can inject the text transcripts so you would want to cache the transcript for the recording in addition to the audio snippet.

1 Like

This is a great idea, and there is already some support for this built in LangChain LLM Response Caching.

Actually it’s not only great from performance PoV, but also from consistency PoV - for example if you are serving a group of people all in the same organization, then it’s nice that for the identical instruction and context/data, they receive an identical response, otherwise it may lead to confusion.

The trickiest part is sorting out the “freshness” of the response cache. So you need some kind of time-to-live (TTL) period. This for example can be very domain specific. If you are asking for a weather forecast, you want it to be refreshed fairly frequently, while for some fact based instruction like “what are the features of iPhone 15”, it can have a nearly infinite expiry period.

I was thinking you’d cash based on the response answer not necessarily the question. So sunny and 78 degrees. Would always hash to the same cache entry and you don’t really have a freshness issue.

It still very much depends on your specific domain.