Hi OpenAI community,
I’m recently comparing OpenAI’s voice realtime API, against other voice AI products’ realtime solutions, such as Deepgram’s Voice Agent, and ElevenLabs’ Conversational AI. To my suprise, they all work almost as good and instantaneous as OpenAI realtime api in terms of response speed.
Under the hood, these two products are using GPT 4o-mini or Claude 3.5 Haiku as their LLM, but somehow they are able to achive sub-second latency from end of speech to first byte voice response and it feels super natural. I think they are still doing the STT/TTS/LLM pattern under the hood, but somehow they can make the whole loop extremely fast.
When I implement my own TTS/STT/LLM test app, the time cost is like:
- Detection of end of speech silence: 500-1000ms
- Send finalized STT response to 4o-mini and having first chunk response streamed back: 800ms - 1s
- TTS to generate first voice response: 200ms
So it looks impossible to reduce the latency to under 2s, but they were able to get a response instantaneous.
I don’t know how these companies achieve it. Would love some knowledge sharing, papers, open source demos that can help me understand the idea.
Sources:
- Revolutionizing Voice Interaction: Deepgram’s AI-Powered Real-Time Voice Agent API | by Brain Titan | Medium
- Introducing Conversational AI | ElevenLabs
Thanks!