Realtime voice API is much much worse than advanced voice mode in the app

I’m trying to make an app to help me practice learning another language.
The advanced voice mode does what I need but it lacks custom functionality for specialized cirruculum and it also interupts too often and they don’t allow you to use the push to talk button like they did in the standard voice mode.

After creating what I wanted in python using the realtime model in the api, I realized the api version is basically useless as it cannot control the speed of it’s voice and seems vastly less intelligent than the app version. It also has much less expression and doesn’t seem to pick up on intonation and queues as well or at all.

It’s tremendously different from the app version and unfortunately kills my usecase.
Is it supposed to be like this?
Did they say the api version was a gimped version somewhere?

2 Likes

Hi @qualityoutdoorgear20

Could you elaborate on this a bit?

The API is a dev tool. Hence, devs have to write their own code for push-to-talk. How the model decides to start speaking based on the voice input can be influenced by the turn_detection parameter.

The voices Ash, Ballad, Coral, Sage, and Verse were introduced for their enhanced vocal intonation capabilities.

e.g. If you want the user to speak some long pause-intensive sentences, you can update the session with the create_response property on turn_detection to false - this will prevent the model from automatically generating a response when a VAD stop event occurs. Then once the user has stopped speaking and indicated that manually, you can trigger a response.create event to make the model start generating response.

1 Like

I was discussing the difference between the advanced voice mode in the app and the API version. I am aware of everything you are saying. My point is, the API version is capable of only a small fraction of the things the app version is capable of, at the level it is capable of, whilst also being much less natural, inteligent and much less socially and intonationally aware. It’s a very watered down product, which is surprising since it’s meant for professional development. Maybe they don’t want people making scam calls with it? In any case, it’s not very usable for language learning unlike the app version. But the app version can’t be customized. Such a shame.

The new voices do work better than the old ones, but their quality is still much lower than the app version. The speech naturalness, their ability to recognize which language im speaking, the responses they give, their ability to control their speed of speech, ect are not very good.

1 Like