Confusion Between Per-Minute Audio Pricing vs. Token-Based Audio Pricing

Hey everyone, I’m trying to figure out the cost for an hour-long audio conversation with GPT. I see two different pricing structures mentioned out there and I’m not sure if I’m mixing them up or if they’re supposed to be the same thing.


Old Per-Minute Pricing (What I Read)

  • $0.06 per minute for audio input
  • $0.24 per minute for audio output

For one hour of conversation where 30 minutes are me talking (input) and 30 minutes are the model replying (output), I’d expect:

  • Input cost: 30 minutes × $0.06/min = $1.80
  • Output cost: 30 minutes × $0.24/min = $7.20
  • Total: $9.00

That part seems straightforward enough.


Token-Based Pricing (What I Also Read)

I also came across a mention of audio costing something like:

  • $40 / 1M tokens for audio input
  • $80 / 1M tokens for audio output

So I tried to compare that to my 30-min-in / 30-min-out scenario. If I assume 140 words per minute of speech, that means:

30 minutes × 140 words/min = 4,200 words

If I do a rough conversion of ~1 word = 1.33 tokens, that’s:

4,200 words × 1.33 tokens/word ≈ 5,600 tokens

  • Audio input tokens: 5,600

    • In millions (M): 0.0056M
    • Cost: 0.0056 × $40 = $0.224
  • Audio output tokens: 5,600

    • In millions (M): 0.0056M
    • Cost: 0.0056 × $80 = $0.448
  • Total token-based cost: $0.224 + $0.448 = $0.672

Comparing $0.672 to the $9.00 from the per-minute method is a huge difference, so I’m not sure if I’m missing something big here. It just seems too cheap when I switch to counting tokens!


My Questions

  1. If text “hey” is counted as 1 text token, is the audio of “hey” also counted as 1 audio token?
  2. Is the “$40 / 1M tokens” approach really meant for the exact same type of audio input that was previously priced at $0.06/min?
  3. Am I undercounting tokens or forgetting to include something (like prompts, system instructions, or conversation overhead)?
  4. Is the old per-minute rate simply an older or different approach that no longer applies?

Thanks in advance.

1 Like

Hi,

Realtime API (as any OpenAI API) was/is always priced based on tokens used. The only difference to the “old” pricing is that openAI mentioned an estimate what the token pricing would come down to per minute. But the charge was always by tokens.

However that estimate was very inaccurate, thats why I think they just removed with with the new pricing.

So, lets break some of your other questions down:

  1. A one hour conversation is not possible, at the moment I think the limit is 30 minutes (or 15) its somewhere in the docs.

  2. From my testing of a “natural” phone conversation over 2 minutes, the cost is about 0.09/USD for the 4o-mini and around 0.21-0.25/USD for the 4o realtime model.

BUT don’t forget that with every input, the whole conversation is sent to the model, meaning the cost increase exponentially. Most of it should be covered by cache hits, but still.

Another cost factor is, interrupting the model. All response tokens are generated while you start listening to the response. So if a large chunk for a response was generated, but you cut it off at the first word - you still pay for those unused but generated output tokens.

The best way to look at a specific use case is to go to the playground, simulate a few conversations and then look at usage/billing to see how much tokens are used, how much cached, and what the conversations are costing you. Look at the logs for the session ID and compare with the detailed usage export.

Quick Example:

User: This is a session.
Assistant: Hi there! What’s on your mind today?

Was:
Audio Token Input: 12
Text Token Input: 759

Audio Token Out: 41
Text Token Out: 19

Hope that helps

2 Likes

Thank you so much! I’ll run my use-case in the playground to see the estimated billing for a 30 minute conversation. Thank you!

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.