Introducing the Realtime API

With the new Realtime API, you can now create seamless speech-to-speech experiences—think ChatGPT’s Advanced Voice, but for your own app.

Until now, building voice into apps required stitching together multiple models, adding latency and usually flattening emotion+texture by using transcript as intermediaries between models.

Available in beta for developers on paid tiers. Check out our docs or use the sample library to get started.

Or just just give it a whirl in the The Playground

22 Likes

Wooo!! The speed on this is jaw-dropping. Seeing it respond before the transcript is even available is… insane.

6 Likes

Awesome! Time to build my MvP for customer! I pitched a proof of concept to them the other day, exciting stuff.

1 Like

@jeffsharris can you please clarify, why doesn’t it feel like in the demos or advanced voice mode?

It is just like STT and the voices sound just as the old ones as well most of the time.

Are there different versions deployed for different regions (EU?).

I can’t get it to sense accent, tone, speed, volume, ambient sounds etc. and it claims to only process text - I am highly confused.

6 Likes

Looks like it’s already provisionable on Azure AI Studio too!

Very cool stuff! Thanks!

3 Likes

I second this. Quite confused and disappointed. It doesn’t sound or feel like I would expect for audio to audio multimodal GPT-4o. It comes across more like an extremely stiff, previous-gen TTS. Like the first version of OpenAI TTS models. It definitely doesn’t seem to have any of the abilities of what we see in advanced voice mode. It can’t change its accent, assume different tones, sing, or, well, really show any benefit to why anyone would want to pay 0.24 per minute of output when it sounds and behaves exactly like a stiff TTS. The latency is great, of course, but 300ms latency is achievable with existing methods that sound a lot more dynamic and human than what I am hearing here.

6 Likes

I also got charged $7 for a few tests running on the openai - sample app repo. Any one else experience the same?
The emotions arent as good

3 Likes

Greetings, @jeffsharris ,

I am developer advocate for Zoom and I’m looking to get started with the Real-Time API. However, despite having a paid account, I’m seeing a message that I don’t have access to it.

Am I missing something?

1 Like

@donte.zoomie

Be sure to select the organisation if that’s the one with a paid account:

When selecting “Personal”:

When selecting our Company:

3 Likes

@traditionals15
Personal is the only organization I have setup. Here is what the billing page looks like:

Is there a specific account or additional plan that needs to be enable for the developer portal?

1 Like

What does it look like when you click this?

1 Like

I see the issue, I needed to add funds to my developer account. While I have a paid account for ChatGPT, the developer portal requires separate funding. Once I added money to my developer account, I gained access to the Real-Time API.

Thanks for your help, @traditionals15 !

1 Like

That is somewhat expected.

Audio is encoded and tokenized. The tokenizer is not disclosed, with no method of estimating the input possible except by looking at a daily bill after. The anecdotes here are a magnitude greater than the estimate provided on the pricing page, beyond the price premium placed on ostensibly the same cost-to-process.

Also not fully described is the management of continuing to send audio into a growing context, and the points where billed inference causes another calculation of “input tokens”.

If OpenAI won’t give any client-side encoder, at least a non-inference endpoint could be offered to perform a totaling token calculation on being sent an audio file and context.

2 Likes

Why does model on API (both OpenAI and Azure) mismatch with the ChatGPT models?

ChatGPT models seem better, they sound better, can actually do multilingual like ukranian language, in API it sounds very weird and has strong american accent.

Although amuch and breeeze voice works better, it’s still not as good as ChatGPT’s.

2 Likes

Yeah, that’s the question. And looking at the demos from dev day I think OpenAI owes us an explanation as what we got is completely different to what was promised. And at a premium.

The American accent is hit or miss for me in German. Sometimes it is there, sometimes not. Voice is always extremely clunky though in German.

Yes, the quality of the audio is definitely different than ChatGPT. I imagine it’s just an issue of resources available and maybe the playground version uses the most compressed audio format? (This option is not available to change except in the API)

EDIT Nope, the playground uses pcm16

Keep in mind it’s a beta product! The important thing is the core of the service.

I mean, really, I don’t think we all want to hear only the 3 same voices throughout all companies :rofl: getting unique voices that match the clarity of ChatGPT MUST be the next step for the RealTime API

1 Like

Aside from the great speed, this is currently very much not worth the price. We want to use an API to provide it to our own customers. 1000 customers talking for 10 minutes for $2 each is $2000, which is rough. At the same time, there is a lack of emotions, laughter, human reactions, emotions, singing, which made our jaws drop on the demo few months ago. And finally the whisper model does not understand me in Hungarian language correctly. Maybe it works well in english but I could not provide this to my hungarian customers if it always misunderstood what I am saying. Or maybe the price of the speed is that the model looks dumber according to their answer.
In summary I will consider the usage after the emotions are introduced but currently it is too expensive for this feature set.

3 Likes

Even 2$ is too low I think, I was billed over 4$ for just a few minutes. The more interruptions (accidental or not) the worse.

2 Likes

My experience is similar. I ran a few tests and got a $7.54 bill.

It’s possible that the OpenAI metering is off.

I don’t seem to have access to all the log files for each session so I cannot validate.

I do not think that my entire usage time was more than 20 minutes.

I cannot afford to develop at these rates.

IMHO, this will impede adoption and limit development to only well funded shops.