Wishlist for Realtime API

bhorseman · January 19, 2025, 12:50pm

Loving the realtime API for my voice application - feels like magic. I have a few suggestions I would love to get people’s feedback on.

Hallucinations. Usually due to high background noise triggering the VAD but no speech input. Would appreciate seeing these reduced.
More intelligent VAD: current VAD is just based on speech power. I’d love to have VAD that understands when people are pausing / in the middle of their sentence based on voice intonation or speech content i.e. Hmmm, umm, well, etc.
More accurate transcriptions. Not talking about Whisper-generated transcriptions. Currently I am asking the Realtime API what the transcription was based on a tool call. I would like better transcriptions understood by the Realtime API.

aza · January 25, 2025, 8:17pm

Sound effects. Play a “thunder clap”, “doube clap” etc.
And of course the BIG one. A video stream to go along with the audio stream that has an avatar of choice (Max Headroom should be hte default).

Topic		Replies	Views
Wishlist for a potential tts-2 and whisper-2 API Feedback whisper , tts	0	175	October 22, 2024
Follow-up Inquiry on Realtime API Issues in AI Interviewer Implementation Bugs	1	214	December 6, 2024
Realtime API issues - good practices API realtime , api-realtime , api-realtime-speech	3	930	January 3, 2025
[Realtime API] Agent responding to microphone input that did not become part of transcription Feedback realtime	0	24	May 2, 2025
Realtime API interruptions are far too sensitive even at a high VAD threshold value Bugs realtime , api-realtime , api-realtime-speech	1	577	January 24, 2025