A bunch of big updates for the Realtime API today. We’re announcing support for WebRTC, meaning you can add speech-to-speech experiences with just a handful of lines of code. (A bunch of you asked after the Rudolph toy demo in the livestream for the embedded SDK — Sean has published it to Github here). We’ve also released two new snapshots:
gpt-4o-realtime-preview-2024-12-17 , which has improved voice quality and more reliable input, and with 60%+ cheaper audio.
gpt-4o-mini-realtime-preview-2024-12-17 , which is our smaller, more cost-efficient model—and is priced at 1/10th the cost of previous prices. The voices sound just as good as 4o!
Been waiting for this! Unfortunately it came too late; I’ve already put together my own live chat solution. Maybe this will be a worthwhile upgrade next year anyway. Cost analysis will be a huge factor in that decision - you can do a lot with 4o-mini and a little preprocessing. Hopefully you guys let us cache function schema tokens separately in the future. Would make it much harder to say no to this.
The real-time voice API is amazing! The voices finally sound interesting instead of robotic and flat, like they do in the app.
Can someone explain the pricing to me, please?
Audio Pricing:
$100.00 per 1M input tokens
$20.00 per 1M cached* input tokens
$200.00 per 1M output tokens
What does this mean? Are they charging based on the tokens generated rather than the length of the conversation?
Also, what is a token in an audio conversation?
One thing the new model still struggles massively with is name recognition in speech. Not going to guess about why, there are some obvious guesses, this has been going on forever with Speech to Text models, but it’s an important weakness that many applications would love to rely on.
If your Voice AI is programmed to record names of callers, a fairly common use case I would think, what name it actually thinks it hears can be very unpredictable. By contrast if its recording their phone number, it’s pretty accurate about what it hears.
Interested about @lylevida 's comment because the transcripts with the previous model were painfully bad . I’ll have to pay more attention to what’s getting written down now, it would be amazing if we’d been upgraded to a newer version of Whispr but I would’ve expected to see that in the release notes - definitely notable.
yeah the transcription is so bad I’ve considered turning openAI’s off and passing audio in parallel to deepgram or google. So I’m really hoping it has been improved (or at least there are plans to do so)
User transcription quality has unfortunately not improved. It’s top priority for me right now, so we’ll have updates there.
Name input is another area we’re working on. If you have an Session_IDs from times when it’s particularly bad that you’re comfortable, please DM them to me and I can make sure we eval against them as we make model improvements
It’s got a bunch more months of updates, so doesn’t surprise me that it’s picked up improvements in function calling. But also honestly, it wasn’t one of the main priorities for the speech team with this model
Thank you for the reply! After another day of testing it does seem a lot better. We had implemented a GPT4o-mini based function calling “cross check” where after the realtime agent responded without a function call, we would cross check to make sure there wasn’t one needed. This doesn’t seem to be needed any longer, so big win.