Realtime API updates — WebRTC, cheaper prices, 4o-mini, and more

jeffsharris · December 17, 2024, 10:25pm

A bunch of big updates for the Realtime API today. We’re announcing support for WebRTC, meaning you can add speech-to-speech experiences with just a handful of lines of code. (A bunch of you asked after the Rudolph toy demo in the livestream for the embedded SDK — Sean has published it to Github here). We’ve also released two new snapshots:

gpt-4o-realtime-preview-2024-12-17 , which has improved voice quality and more reliable input, and with 60%+ cheaper audio.
gpt-4o-mini-realtime-preview-2024-12-17 , which is our smaller, more cost-efficient model—and is priced at 1/10th the cost of previous prices. The voices sound just as good as 4o!
These models are also available in ChatCompletions

We’ve also been adding more features to the Realtime beta that let you have more control over responses:

Concurrent out-of-band responses⁠
Custom input context⁠
Controlled response timing⁠
Increased maximum session length⁠ from 15 to 30 min.

You can read more on how to get started in our docs.

Our poke around for yourself in the playground!

anon25271712 · December 18, 2024, 1:02am

i absolutely love gpt-4o-audio-preview. the quality is amazing! can’t wait to implement it and deploy it! thank you so much Jeff and API team!

curt.kennedy · December 18, 2024, 5:52am

I have an ESP32 LyraT laying around. I wonder if it will work?

But this “AI speaker” hooked to the API has been a desire of mine for at least a year!

Good to see some others on the same wavelength.

GoldenJoe · December 18, 2024, 6:56am

Been waiting for this! Unfortunately it came too late; I’ve already put together my own live chat solution. Maybe this will be a worthwhile upgrade next year anyway. Cost analysis will be a huge factor in that decision - you can do a lot with 4o-mini and a little preprocessing. Hopefully you guys let us cache function schema tokens separately in the future. Would make it much harder to say no to this.

thomas11 · December 18, 2024, 7:16am

It’s amazing, but I can’t get it to respond with JSON …

DawidM · December 18, 2024, 7:40am

The real-time voice API is amazing! The voices finally sound interesting instead of robotic and flat, like they do in the app.

Can someone explain the pricing to me, please?

Audio Pricing:

$100.00 per 1M input tokens
$20.00 per 1M cached* input tokens
$200.00 per 1M output tokens

What does this mean? Are they charging based on the tokens generated rather than the length of the conversation?
Also, what is a token in an audio conversation?

mamad-06 · December 18, 2024, 8:11am

Is it availing through Azure OpenAI hosted in Europe?

agsillvaa · December 18, 2024, 2:06pm

Congratulations on the announcement!

lylevida · December 18, 2024, 8:27pm

Function calling seems to be a lot better with this release. The previous release was very bad at function calling VS vanilla GPT4o.

Can you confirm this is something you guys targetted?

Transcript quality seems to also be improved. Anything changed there?

dwalker · December 18, 2024, 10:03pm

One thing the new model still struggles massively with is name recognition in speech. Not going to guess about why, there are some obvious guesses, this has been going on forever with Speech to Text models, but it’s an important weakness that many applications would love to rely on.

If your Voice AI is programmed to record names of callers, a fairly common use case I would think, what name it actually thinks it hears can be very unpredictable. By contrast if its recording their phone number, it’s pretty accurate about what it hears.

Interested about @lylevida 's comment because the transcripts with the previous model were painfully bad . I’ll have to pay more attention to what’s getting written down now, it would be amazing if we’d been upgraded to a newer version of Whispr but I would’ve expected to see that in the release notes - definitely notable.

lylevida · December 18, 2024, 10:28pm

yeah the transcription is so bad I’ve considered turning openAI’s off and passing audio in parallel to deepgram or google. So I’m really hoping it has been improved (or at least there are plans to do so)

jeffsharris · December 18, 2024, 10:49pm

User transcription quality has unfortunately not improved. It’s top priority for me right now, so we’ll have updates there.

Name input is another area we’re working on. If you have an Session_IDs from times when it’s particularly bad that you’re comfortable, please DM them to me and I can make sure we eval against them as we make model improvements

lylevida · December 18, 2024, 11:15pm

possible to comment on this function calling performance question?

jeffsharris · December 19, 2024, 6:02am

It’s got a bunch more months of updates, so doesn’t surprise me that it’s picked up improvements in function calling. But also honestly, it wasn’t one of the main priorities for the speech team with this model

anon37218972 · December 19, 2024, 7:01am

The world has truly entered the Great RTC Era!

lylevida · December 19, 2024, 3:13pm

Thank you for the reply! After another day of testing it does seem a lot better. We had implemented a GPT4o-mini based function calling “cross check” where after the realtime agent responded without a function call, we would cross check to make sure there wasn’t one needed. This doesn’t seem to be needed any longer, so big win.

skisquaw · December 19, 2024, 11:57pm

It provides usage for each conversation turn, like this:

“usage”: [
{
“total_tokens”: 1802,
“input_tokens”: 1477,
“output_tokens”: 325,
“input_token_details”: {
“text_tokens”: 1477,
“audio_tokens”: 0,
“cached_tokens”: 0,
“cached_tokens_details”: {
“text_tokens”: 0,
“audio_tokens”: 0
}
},
“output_token_details”: {
“text_tokens”: 60,
“audio_tokens”: 265
}
},
{
“total_tokens”: 2006,
“input_tokens”: 1822,
“output_tokens”: 184,
“input_token_details”: {
“text_tokens”: 1547,
“audio_tokens”: 275,
“cached_tokens”: 1792,
“cached_tokens_details”: {
“text_tokens”: 1536,
“audio_tokens”: 256
}
},
“output_token_details”: {
“text_tokens”: 38,
“audio_tokens”: 146
}
},
{
“total_tokens”: 2211,
“input_tokens”: 2027,
“output_tokens”: 184,
“input_token_details”: {
“text_tokens”: 1595,
“audio_tokens”: 432,
“cached_tokens”: 1856,
“cached_tokens_details”: {
“text_tokens”: 1536,
“audio_tokens”: 320
}
},
“output_token_details”: {
“text_tokens”: 37,
“audio_tokens”: 147
}
},

mcgyver2k · December 20, 2024, 12:37am

So, with that last post, would this mean that any time the model is executed it uses 2,211 tokens?

skisquaw · December 20, 2024, 12:48am

Yes, my system prompt is about 1800 tokens

Topic		Replies	Views
New Realtime API voices and cache pricing Announcements realtime , prompt-caching	26	8466	November 27, 2024
Introducing the Realtime API Announcements	28	8238	January 16, 2025
Realtime API - WebRTC, randomly receiving no response API	5	496	December 30, 2024
Realtime API "rate_limit_exceeded" "We're currently processing too many requests — please try again later." API realtime	8	1225	October 11, 2024
Whats up with the 100 uses per day on 1106-preview? API gpt-4	4	2109	November 7, 2023

Realtime API updates — WebRTC, cheaper prices, 4o-mini, and more

Related topics