Hey everyone, I’m trying to figure out the cost for an hour-long audio conversation with GPT. I see two different pricing structures mentioned out there and I’m not sure if I’m mixing them up or if they’re supposed to be the same thing.
Old Per-Minute Pricing (What I Read)
- $0.06 per minute for audio input
- $0.24 per minute for audio output
For one hour of conversation where 30 minutes are me talking (input) and 30 minutes are the model replying (output), I’d expect:
- Input cost: 30 minutes × $0.06/min = $1.80
- Output cost: 30 minutes × $0.24/min = $7.20
- Total: $9.00
That part seems straightforward enough.
Token-Based Pricing (What I Also Read)
I also came across a mention of audio costing something like:
- $40 / 1M tokens for audio input
- $80 / 1M tokens for audio output
So I tried to compare that to my 30-min-in / 30-min-out scenario. If I assume 140 words per minute of speech, that means:
30 minutes × 140 words/min = 4,200 words
If I do a rough conversion of ~1 word = 1.33 tokens, that’s:
4,200 words × 1.33 tokens/word ≈ 5,600 tokens
-
Audio input tokens: 5,600
- In millions (M): 0.0056M
- Cost: 0.0056 × $40 = $0.224
-
Audio output tokens: 5,600
- In millions (M): 0.0056M
- Cost: 0.0056 × $80 = $0.448
-
Total token-based cost: $0.224 + $0.448 = $0.672
Comparing $0.672 to the $9.00 from the per-minute method is a huge difference, so I’m not sure if I’m missing something big here. It just seems too cheap when I switch to counting tokens!
My Questions
- If text “hey” is counted as 1 text token, is the audio of “hey” also counted as 1 audio token?
- Is the “$40 / 1M tokens” approach really meant for the exact same type of audio input that was previously priced at $0.06/min?
- Am I undercounting tokens or forgetting to include something (like prompts, system instructions, or conversation overhead)?
- Is the old per-minute rate simply an older or different approach that no longer applies?
Thanks in advance.