Wishlist for Realtime API

Loving the realtime API for my voice application - feels like magic. I have a few suggestions I would love to get people’s feedback on.

  1. Hallucinations. Usually due to high background noise triggering the VAD but no speech input. Would appreciate seeing these reduced.
  2. More intelligent VAD: current VAD is just based on speech power. I’d love to have VAD that understands when people are pausing / in the middle of their sentence based on voice intonation or speech content i.e. Hmmm, umm, well, etc.
  3. More accurate transcriptions. Not talking about Whisper-generated transcriptions. Currently I am asking the Realtime API what the transcription was based on a tool call. I would like better transcriptions understood by the Realtime API.
1 Like