I’m trying to make an app to help me practice learning another language.
The advanced voice mode does what I need but it lacks custom functionality for specialized cirruculum and it also interupts too often and they don’t allow you to use the push to talk button like they did in the standard voice mode.
After creating what I wanted in python using the realtime model in the api, I realized the api version is basically useless as it cannot control the speed of it’s voice and seems vastly less intelligent than the app version. It also has much less expression and doesn’t seem to pick up on intonation and queues as well or at all.
It’s tremendously different from the app version and unfortunately kills my usecase.
Is it supposed to be like this?
Did they say the api version was a gimped version somewhere?
The API is a dev tool. Hence, devs have to write their own code for push-to-talk. How the model decides to start speaking based on the voice input can be influenced by the turn_detection parameter.
The voices Ash, Ballad, Coral, Sage, and Verse were introduced for their enhanced vocal intonation capabilities.
e.g. If you want the user to speak some long pause-intensive sentences, you can update the session with the create_response property on turn_detection to false - this will prevent the model from automatically generating a response when a VAD stop event occurs. Then once the user has stopped speaking and indicated that manually, you can trigger a response.create event to make the model start generating response.
I was discussing the difference between the advanced voice mode in the app and the API version. I am aware of everything you are saying. My point is, the API version is capable of only a small fraction of the things the app version is capable of, at the level it is capable of, whilst also being much less natural, inteligent and much less socially and intonationally aware. It’s a very watered down product, which is surprising since it’s meant for professional development. Maybe they don’t want people making scam calls with it? In any case, it’s not very usable for language learning unlike the app version. But the app version can’t be customized. Such a shame.
The new voices do work better than the old ones, but their quality is still much lower than the app version. The speech naturalness, their ability to recognize which language im speaking, the responses they give, their ability to control their speed of speech, ect are not very good.
Every single one of these posts ends up with the same response: silence. There is no official response from openAI anywhere I can only imagine they simply don’t want people to know or use their AVM. It’s too realistic and I assume they think people are going to scam with it. It still would be nice to have an official answer if it’s ever coming or not. In the meantime we have had better success with other TTS engines with a side effect of additional delays. Not ideal but does sound better than the terrible voices available in the realtime API.