It’s $100 per 1M tokens of audio input and $200 per 1M tokens of audio output. Yet the footnote says $0.06 per minute of input audio and $0.24 per minute of output audio. How is it that one of those numbers isn’t $0.12? Is the resolution of the output audio twice the resolution of input audio? Can we control that???
If you look at the sample code it’s the same pcm16 - 24k for both input & output so the output cost should be $0.12 per minute of output according to their pricing given that the input & output audio sizes are the same…
I did the math and using their $0.06 per minute input pricing that works out to 600 tokens per minute of input data. Since the input & output encoding formats are the same, you would expect the same 600 tokens per minute out which should be $0.12, not $0.24.
I suspect it’s a typo due to some last minute pricing change. I just want to verify that.
Before Microsoft I worked on VoIP and call center software… Silence is super easy to filter out so I would hope we’re not getting billed for silence and even if we are the audio stream should be symmetrical. The token rate in should match the token rate out. The pricing says $200 per 1 million tokens out but the example data rate implies a difference of $400 per 1 million tokens out.
How so? It’s an uncompressed audio stream. Are you saying the reasoning behind the response is more? They should clarify that… That’s not what they implied…
@anon22939549 you’re correct… Their billing based on the text tokens generated. Not the amount of audio generated. They’re just charging a premium to convert that text to audio…
That’s only partially correct… They return text with every audio response and you can pass them in either text or audio. Basically they have a running transcription they’re managing along with the audio. There may be things in the audio output that don’t make it to the transcript but the transcript is important…
If you want to continue a long running conversation you have to persist the transcript and then seed the conversation with the text transcript in some future session. They don’t let the client seed the conversation history with audio, only text.
It’s surprising that no one has mentioned how this extreme pricing makes the service essentially unusable at this price point. The benefits of low latency and other features are negated by the cost. A more cost-effective approach would be to use a cheaper but incredibly fast-responding LLM like GPT-4o-mini, or even a model with fast inference like Groq, combined with traditional but low-latency text-to-speech (TTS) services.
If implemented correctly, a pipeline of Speech-to-Text (STT) > LLM > Text Output > TTS Output would be many times cheaper than this new WebSocket audio stream API.
While the technology is impressive and I appreciate them providing access, for any profit-oriented company without millions in venture capital funding to burn on customer acquisition (as a loss leader), this service is practically unusable.
if you look at their example, simple prompts consume as little as 50 tokens; 20 input tokens and 30 output tokens. It’s a voice interface so you generally want short responses. Let’s say your average output is 100 tokens in length. That’s 10,000 responses per 1 million output tokens or about $0.005 per response.
You’ll need to be smarter about how you manage prompts. What we don’t know yet is how much tool calls cost (they didn’t specifically say) but you’ll probably want to defer a lot of reasoning to happen within tools.
In with you. IMHO, OpenAI has been reasonably priced before that. But at 24 (or even 12 cents) per minute, I cannot find a realistic use case where this can be used at scale.
At those prices one could hire students to read out.
I’m not sure the prices are going to go down substantially / rapidly enough to take the risk of integrating anytime soon.
Keep in mind that a year ago the pricing for GPT models were more expensive for much less ability. Now the pricing has dramatically dropped and the ability has shot through the roof.
Pricing is not something that should be holding you back. Even at the current pricing there are a lot of use-cases.
One thing I’ll add to the discussion is that I started looking into seeing if I could use Eleven Labs to implement a cheaper version of Realtime but when I looked into their pricing they’re at $0.30 per minute. Realtime is actually cheaper then them.
I think OpenAI knows exactly who they’re competing against and they’re pricing things relative to their competitors pricing.
Agree, I was looking forward to adding this to a solution, and technically it works great . I implemented it, but I can’t charge users thousands of dollars a month for a service that would normally have a max subscription of $15. Used it for a minute and cost nearly 5 bucks. Love the tech but sadly unusable for my solution.
Yeah… If that’s true could you take a look at your activity tab in the dashboard under usage and tell us your input and output tokens for the Realtime API? I’m super curious as well…
I’ve been lightly banging on it for a bit now and I’m only at $7
That was 1 minute, and yes it was a conversation that was natural for my solution in a one minute interaction… maybe useful for other use-cases but in my case not so much. There are other great things from the latest dev day and I love the AI wizards for functions structured outputs and instructions. So all in all I’m a happy camper and will use my existing TTS functions.
I’ve had a similar interaction with the realtime api, maybe not $5 for 1 minute, but $5 for about three minutes of conversation.
We exchanged about 17 calls back and forth. I interrupted the model a few times when i thought it was going astray from our conversation. It was a bit of a rapid back and forth (as is the whole point of this api). The cost mainly arose from the consumed input tokens. It almost seems like the api will prepend the entire conversation to every new message, causing a huge growth in tokens consumed when you do several messages and a quick back and forth when you have a bit of a history built up.
Anyways, my reaction was the same. There is no way this is usable in a production app today with the pricing being what it is. And thats without using some sort of RAG content to go with the conversation as was my intention. The latency and tonality improvements compared to a classic voice pipeline are not worth it yet imho.
Edit: for clarity, I was using the OpenAI built test project
My solution requires quite a lot of data and that is probably the reason. Data is passed as text as the real-time API does not have an assistant vector database or similar. So yes this is not a solution that can work with standard knowledge from 2023 and there are several function calls to fetch relevant data. I just point out that for my solution that is using unique data it is a no go as of now. We can only charge 15$ a month from the end-user. Not talking about other use-cases.
That’s the cost tab… can i see the activity tab? If you hover over the Tokens bar you’ll get a breakdown of input vs output tokens. That’s what I’m curious about.