How does ElevenLabs or Deepgram realtime voice agents work as good as OpenAI Realtime API?

Hi OpenAI community,

I’m recently comparing OpenAI’s voice realtime API, against other voice AI products’ realtime solutions, such as Deepgram’s Voice Agent, and ElevenLabs’ Conversational AI. To my suprise, they all work almost as good and instantaneous as OpenAI realtime api in terms of response speed.

Under the hood, these two products are using GPT 4o-mini or Claude 3.5 Haiku as their LLM, but somehow they are able to achive sub-second latency from end of speech to first byte voice response and it feels super natural. I think they are still doing the STT/TTS/LLM pattern under the hood, but somehow they can make the whole loop extremely fast.

When I implement my own TTS/STT/LLM test app, the time cost is like:

  1. Detection of end of speech silence: 500-1000ms
  2. Send finalized STT response to 4o-mini and having first chunk response streamed back: 800ms - 1s
  3. TTS to generate first voice response: 200ms
    So it looks impossible to reduce the latency to under 2s, but they were able to get a response instantaneous.

I don’t know how these companies achieve it. Would love some knowledge sharing, papers, open source demos that can help me understand the idea.

Sources:

Thanks!