I’ve used the RealtimeAPI interface and tweaked the parameters quite a bit, but I still have some doubts as to whether it’s a finished, 100% usable product.
First of all, I feel that the instructions can’t be too long (750 characters seems to be the maximum), otherwise the model gets lost and confused.
There’s always a voice cut-off. I think the model is listening to itself speak and mistakenly thinks it’s the user. The result is an endless loop of interruptions.
Also, the speech transcription is not really good, and sometimes even detects French or Korean when I’m speaking in English. I don’t see any settings to detect only a specific language (but the question has been asked on this forum).
Have you had similar problems, and if so, have you found a way to solve them?
Do you have any recommendations?
For voice cutting off this seems to be a common issue. Even the official advanced voice mode does this.
You should be able to inspect the logs to determine if it’s cutting off the voice as a result of hearing itself. There’s also methods available to prevent this from happening.
In terms of transcription issues: yes, this is a common issue with whisper, which attempts to initially classify the audio before performing transcription/translation. You may be better off handling the transcription yourself as I believe you are charged for it.
For instructions: ideally the RealTime API model should be kept as lightweight as possible. This is inherent in any pioneering model. It’s all about micromanaging until abundance becomes available.
It is in a preview stage, not production. Great for a PoC but don’t let anyone tell you that it’s fully functional in production