Realtime API updates — WebRTC, cheaper prices, 4o-mini, and more

A bunch of big updates for the Realtime API today. We’re announcing support for WebRTC, meaning you can add speech-to-speech experiences with just a handful of lines of code. (A bunch of you asked after the Rudolph toy demo in the livestream for the embedded SDK — Sean has published it to Github here). We’ve also released two new snapshots:

  • gpt-4o-realtime-preview-2024-12-17 , which has improved voice quality and more reliable input, and with 60%+ cheaper audio.
  • gpt-4o-mini-realtime-preview-2024-12-17 , which is our smaller, more cost-efficient model—and is priced at 1/10th the cost of previous prices. The voices sound just as good as 4o!
  • These models are also available in ChatCompletions

We’ve also been adding more features to the Realtime beta that let you have more control over responses:

You can read more on how to get started in our docs.

Our poke around for yourself in the playground!

26 Likes

i absolutely love gpt-4o-audio-preview. the quality is amazing! can’t wait to implement it and deploy it! thank you so much Jeff and API team!

5 Likes

I have an ESP32 LyraT laying around. I wonder if it will work?

But this “AI speaker” hooked to the API has been a desire of mine for at least a year!

Good to see some others on the same wavelength. :rocket:

2 Likes

Been waiting for this! Unfortunately it came too late; I’ve already put together my own live chat solution. Maybe this will be a worthwhile upgrade next year anyway. Cost analysis will be a huge factor in that decision - you can do a lot with 4o-mini and a little preprocessing. Hopefully you guys let us cache function schema tokens separately in the future. Would make it much harder to say no to this.

1 Like

It’s amazing, but I can’t get it to respond with JSON … :confused:

2 Likes

The real-time voice API is amazing! The voices finally sound interesting instead of robotic and flat, like they do in the app.

Can someone explain the pricing to me, please?

Audio Pricing:

  • $100.00 per 1M input tokens
  • $20.00 per 1M cached* input tokens
  • $200.00 per 1M output tokens

What does this mean? Are they charging based on the tokens generated rather than the length of the conversation?
Also, what is a token in an audio conversation?

1 Like

Is it availing through Azure OpenAI hosted in Europe?

1 Like

Congratulations on the announcement!

Function calling seems to be a lot better with this release. The previous release was very bad at function calling VS vanilla GPT4o.

Can you confirm this is something you guys targetted?

Transcript quality seems to also be improved. Anything changed there?

One thing the new model still struggles massively with is name recognition in speech. Not going to guess about why, there are some obvious guesses, this has been going on forever with Speech to Text models, but it’s an important weakness that many applications would love to rely on.

If your Voice AI is programmed to record names of callers, a fairly common use case I would think, what name it actually thinks it hears can be very unpredictable. By contrast if its recording their phone number, it’s pretty accurate about what it hears.

Interested about @lylevida 's comment because the transcripts with the previous model were painfully bad :smile: . I’ll have to pay more attention to what’s getting written down now, it would be amazing if we’d been upgraded to a newer version of Whispr but I would’ve expected to see that in the release notes - definitely notable.

2 Likes

yeah the transcription is so bad I’ve considered turning openAI’s off and passing audio in parallel to deepgram or google. So I’m really hoping it has been improved (or at least there are plans to do so)

2 Likes

User transcription quality has unfortunately not improved. It’s top priority for me right now, so we’ll have updates there.

Name input is another area we’re working on. If you have an Session_IDs from times when it’s particularly bad that you’re comfortable, please DM them to me and I can make sure we eval against them as we make model improvements

9 Likes

possible to comment on this function calling performance question?

It’s got a bunch more months of updates, so doesn’t surprise me that it’s picked up improvements in function calling. But also honestly, it wasn’t one of the main priorities for the speech team with this model

2 Likes

The world has truly entered the Great RTC Era!

Thank you for the reply! After another day of testing it does seem a lot better. We had implemented a GPT4o-mini based function calling “cross check” where after the realtime agent responded without a function call, we would cross check to make sure there wasn’t one needed. This doesn’t seem to be needed any longer, so big win.

2 Likes

It provides usage for each conversation turn, like this:

“usage”: [
{
“total_tokens”: 1802,
“input_tokens”: 1477,
“output_tokens”: 325,
“input_token_details”: {
“text_tokens”: 1477,
“audio_tokens”: 0,
“cached_tokens”: 0,
“cached_tokens_details”: {
“text_tokens”: 0,
“audio_tokens”: 0
}
},
“output_token_details”: {
“text_tokens”: 60,
“audio_tokens”: 265
}
},
{
“total_tokens”: 2006,
“input_tokens”: 1822,
“output_tokens”: 184,
“input_token_details”: {
“text_tokens”: 1547,
“audio_tokens”: 275,
“cached_tokens”: 1792,
“cached_tokens_details”: {
“text_tokens”: 1536,
“audio_tokens”: 256
}
},
“output_token_details”: {
“text_tokens”: 38,
“audio_tokens”: 146
}
},
{
“total_tokens”: 2211,
“input_tokens”: 2027,
“output_tokens”: 184,
“input_token_details”: {
“text_tokens”: 1595,
“audio_tokens”: 432,
“cached_tokens”: 1856,
“cached_tokens_details”: {
“text_tokens”: 1536,
“audio_tokens”: 320
}
},
“output_token_details”: {
“text_tokens”: 37,
“audio_tokens”: 147
}
},

So, with that last post, would this mean that any time the model is executed it uses 2,211 tokens?

Yes, my system prompt is about 1800 tokens