With the new Realtime API, you can now create seamless speech-to-speech experiences—think ChatGPT’s Advanced Voice, but for your own app.
Until now, building voice into apps required stitching together multiple models, adding latency and usually flattening emotion+texture by using transcript as intermediaries between models.
Available in beta for developers on paid tiers. Check out our docs or use the sample library to get started.
I second this. Quite confused and disappointed. It doesn’t sound or feel like I would expect for audio to audio multimodal GPT-4o. It comes across more like an extremely stiff, previous-gen TTS. Like the first version of OpenAI TTS models. It definitely doesn’t seem to have any of the abilities of what we see in advanced voice mode. It can’t change its accent, assume different tones, sing, or, well, really show any benefit to why anyone would want to pay 0.24 per minute of output when it sounds and behaves exactly like a stiff TTS. The latency is great, of course, but 300ms latency is achievable with existing methods that sound a lot more dynamic and human than what I am hearing here.
I am developer advocate for Zoom and I’m looking to get started with the Real-Time API. However, despite having a paid account, I’m seeing a message that I don’t have access to it.
I see the issue, I needed to add funds to my developer account. While I have a paid account for ChatGPT, the developer portal requires separate funding. Once I added money to my developer account, I gained access to the Real-Time API.
Audio is encoded and tokenized. The tokenizer is not disclosed, with no method of estimating the input possible except by looking at a daily bill after. The anecdotes here are a magnitude greater than the estimate provided on the pricing page, beyond the price premium placed on ostensibly the same cost-to-process.
Also not fully described is the management of continuing to send audio into a growing context, and the points where billed inference causes another calculation of “input tokens”.
If OpenAI won’t give any client-side encoder, at least a non-inference endpoint could be offered to perform a totaling token calculation on being sent an audio file and context.
Why does model on API (both OpenAI and Azure) mismatch with the ChatGPT models?
ChatGPT models seem better, they sound better, can actually do multilingual like ukranian language, in API it sounds very weird and has strong american accent.
Although amuch and breeeze voice works better, it’s still not as good as ChatGPT’s.
Yeah, that’s the question. And looking at the demos from dev day I think OpenAI owes us an explanation as what we got is completely different to what was promised. And at a premium.
The American accent is hit or miss for me in German. Sometimes it is there, sometimes not. Voice is always extremely clunky though in German.
Yes, the quality of the audio is definitely different than ChatGPT. I imagine it’s just an issue of resources available and maybe the playground version uses the most compressed audio format? (This option is not available to change except in the API)
EDIT Nope, the playground uses pcm16
Keep in mind it’s a beta product! The important thing is the core of the service.
I mean, really, I don’t think we all want to hear only the 3 same voices throughout all companies getting unique voices that match the clarity of ChatGPT MUST be the next step for the RealTime API
Aside from the great speed, this is currently very much not worth the price. We want to use an API to provide it to our own customers. 1000 customers talking for 10 minutes for $2 each is $2000, which is rough. At the same time, there is a lack of emotions, laughter, human reactions, emotions, singing, which made our jaws drop on the demo few months ago. And finally the whisper model does not understand me in Hungarian language correctly. Maybe it works well in english but I could not provide this to my hungarian customers if it always misunderstood what I am saying. Or maybe the price of the speed is that the model looks dumber according to their answer.
In summary I will consider the usage after the emotions are introduced but currently it is too expensive for this feature set.