I’m confused by the cached tokens. If I have 313 input text_tokens and 256 cached text_tokens does that mean I need to calculate the cost of 313-256 = 57 text tokens ($2.50/million) and then add on the cost of 256 cached tokens?
The blog entry https://openai.com/index/o1-and-new-tools-for-developers/ says “Cached audio input costs are reduced by 87.5% to $2.50/1M input tokens” but doesn’t say anything about text tokens. BUT for the new GPT-4o mini audio preview API it says “Cached audio and text both cost $0.30/1M tokens” - does that mean that for GPT-4o audio preview cached text tokens cost the same as cached audio tokens?
To the best of my knowledge, cached tokens are charged at 50% of normal, so yes, remove those from the total and add on half that number to get an accurate cost.
The cached tokens pricing is listed under the Realtime API section (just under Fine-tuning models) on that page, gotta scroll further down a little more (I don’t know why it’s that low in the page lol).
One thing to note is that the gpt-4o-audio-preview and gpt-4o-mini-audio-preview models are available in the Chat Completions API and differ from the Realtime API models.
As for your usage calculation, this is what the pricing page says for gpt-4o-realtime-preview-2024-12-17 (which is the new realtime snapshot released just yesterday):
It looks like the text tokens haven’t changed in terms of pricing, but the audio has indeed received that 87.5% reduction for cached tokens.
The previous realtime model snapshot gpt-4o-realtime-preview-2024-10-01 costs $20 / 1M for cached audio tokens, but the new gpt-4o-realtime-preview-2024-12-17 costs $2.5 / 1M. This also means the pricing for cached audio and text tokens are the same in this snapshot (both $2.5 / 1M according to the pricing page).