Help me understand the realtime usage block

Text input that hits the cache costs 50% less. Audio input that hits the cache costs 80% less.

Here is the announcement regarding prompt caching on the Realtime API:

1 Like