Help me understand the realtime usage block

I’m getting this usage block from the WebRTC realtime API and I want to calculate total cost based on it:

{
  "total_tokens": 821,
  "input_tokens": 789,
  "output_tokens": 32,
  "input_token_details": {
    "text_tokens": 313,
    "audio_tokens": 476,
    "cached_tokens": 640,
    "cached_tokens_details": {
      "text_tokens": 256,
      "audio_tokens": 384
    }
  },
  "output_token_details": {
    "text_tokens": 9,
    "audio_tokens": 23
  }
}

I’m confused by the cached tokens. If I have 313 input text_tokens and 256 cached text_tokens does that mean I need to calculate the cost of 313-256 = 57 text tokens ($2.50/million) and then add on the cost of 256 cached tokens?

The price of cached tokens for the audio preview API isn’t listed on https://openai.com/api/pricing/

The blog entry https://openai.com/index/o1-and-new-tools-for-developers/ says “Cached audio input costs are reduced by 87.5% to $2.50/1M input tokens” but doesn’t say anything about text tokens. BUT for the new GPT-4o mini audio preview API it says “Cached audio and text both cost $0.30/1M tokens” - does that mean that for GPT-4o audio preview cached text tokens cost the same as cached audio tokens?

1 Like

To the best of my knowledge, cached tokens are charged at 50% of normal, so yes, remove those from the total and add on half that number to get an accurate cost.

1 Like

The cached tokens pricing is listed under the Realtime API section (just under Fine-tuning models) on that page, gotta scroll further down a little more (I don’t know why it’s that low in the page lol).

One thing to note is that the gpt-4o-audio-preview and gpt-4o-mini-audio-preview models are available in the Chat Completions API and differ from the Realtime API models.

As for your usage calculation, this is what the pricing page says for gpt-4o-realtime-preview-2024-12-17 (which is the new realtime snapshot released just yesterday):

Text
$5.00 / 1M input tokens
$2.50 / 1M cached* input tokens
$20.00 / 1M output tokens
Audio
$40.00 / 1M input tokens
$2.50 / 1M cached* input tokens
$80.00 / 1M output tokens

Based on my understanding of the pricing, for your example it works like this:

# Input text tokens
total: 313 tokens
--> cached: 256 tokens (billed $2.5 / 1M)
--> normal: 313-256 = 57 tokens (billed $5 / 1M)

# Input audio tokens
total: 476 tokens
--> cached: 384 tokens (billed $2.5 / 1M)
--> normal: 476-384 = 92 tokens (billed $40 / 1M)

It looks like the text tokens haven’t changed in terms of pricing, but the audio has indeed received that 87.5% reduction for cached tokens.
The previous realtime model snapshot gpt-4o-realtime-preview-2024-10-01 costs $20 / 1M for cached audio tokens, but the new gpt-4o-realtime-preview-2024-12-17 costs $2.5 / 1M. This also means the pricing for cached audio and text tokens are the same in this snapshot (both $2.5 / 1M according to the pricing page).

Here’s what I ended up implementing. I’m not 100% confident I’ve got the calculations right though:

https://tools.simonwillison.net/openai-webrtc

Source code here: tools/openai-webrtc.html at c9f3085107fd1177329846de95c840eda64b1748 · simonw/tools · GitHub

Text input that hits the cache costs 50% less. Audio input that hits the cache costs 80% less.

Here is the announcement regarding prompt caching on the Realtime API:

1 Like