Loving the realtime API for my voice application - feels like magic. I have a few suggestions I would love to get people’s feedback on.
- Hallucinations. Usually due to high background noise triggering the VAD but no speech input. Would appreciate seeing these reduced.
- More intelligent VAD: current VAD is just based on speech power. I’d love to have VAD that understands when people are pausing / in the middle of their sentence based on voice intonation or speech content i.e. Hmmm, umm, well, etc.
- More accurate transcriptions. Not talking about Whisper-generated transcriptions. Currently I am asking the Realtime API what the transcription was based on a tool call. I would like better transcriptions understood by the Realtime API.