How does ElevenLabs or Deepgram realtime voice agents work as good as OpenAI Realtime API?

Hi OpenAI community,

I’m recently comparing OpenAI’s voice realtime API, against other voice AI products’ realtime solutions, such as Deepgram’s Voice Agent, and ElevenLabs’ Conversational AI. To my suprise, they all work almost as good and instantaneous as OpenAI realtime api in terms of response speed.

Under the hood, these two products are using GPT 4o-mini or Claude 3.5 Haiku as their LLM, but somehow they are able to achive sub-second latency from end of speech to first byte voice response and it feels super natural. I think they are still doing the STT/TTS/LLM pattern under the hood, but somehow they can make the whole loop extremely fast.

When I implement my own TTS/STT/LLM test app, the time cost is like:

  1. Detection of end of speech silence: 500-1000ms
  2. Send finalized STT response to 4o-mini and having first chunk response streamed back: 800ms - 1s
  3. TTS to generate first voice response: 200ms
    So it looks impossible to reduce the latency to under 2s, but they were able to get a response instantaneous.

I don’t know how these companies achieve it. Would love some knowledge sharing, papers, open source demos that can help me understand the idea.

Sources:

Thanks!

4 Likes

Hey I am very curious about this as well… If you get any insights please share

While I don’t work at any of these companies, my guess is that they use TPUs/LPUs locally which are very fast and pretty much “made for AI”.
Having a local environment also removes the latency of requests.

Cheers. :hugs:

Similarly interested in insights here on how they’re this fast despite the ‘old’ STT->LLM->TTS flow.

Also allanjsx, which provider(s) did you get 200ms with for TTS’s TTFB? I’m seeing 7x higher with OpenAI’s non-HD TTS, with occasionally much longer (i.e. high variance).