Eleven labs seem to be much faster than Open AI in text to speech (tts)

I ran some testing to compare the tts latency of eleven labs and open ai to process relative short text. Eleven labs seems to be about four times faster than open ai.

Three iterations of the test cases:

  • short: 10 words
  • medium: 24 words
  • long: 64 words

TTS Latency Test Results for open ai:

  • Number of tests: 9

  • Average generation time: 9.70 seconds

  • Average audio duration: 10.51 seconds

  • Average processing speed: 5.28 words/second

  • Average speaking speed: 3.36 words/second

TTS Latency Test Results for elevenlabs:

  • Number of tests: 9

  • Average generation time: 2.38 seconds

  • Average audio duration: 9.84 seconds

  • Average processing speed: 13.17 words/second

  • Average speaking speed: 3.44 words/second

The title says “speech to text” but I guess you meant “text to speech” :slight_smile:
I don’t know about your setup but that generation time depends on so many things. For example the exact model(s) you’re using. For OpenAI TTS, that could also be the HD model (which takes longer). 11labs also has 2 model families: standard and turbo with different models.

I always recommend Text to Speech Models and Providers Leaderboard | Artificial Analysis for a detailed comparison!

3 Likes

Elevenlabs is surely in the vanguard in terms of quality, but I personally have greater expectatives with openai because it is more cost effectively and offers a greater variety of services in their API.
That said, there is a lot of room for improvement on openai TTS services, and I hope they focus on that because one bottleneck of developing interactive services is that no matter how good the AI response is, people will still judge it by the quality of the voice or the capability to correctly transcribe speech audio.

I found Lemonfox.ai to be great alternative to Elevenlabs and OpenAI TTS. It’s quite fast, offers an OpenAI and Elevenlabs-compatible API and is much cheaper than both of them.

2 Likes

Interesting - based in Germany! :+1:

latency means delay.
“Average generation time…” is completely irrelevant.
Nobody cares how long it takes to convert text to audio, the only thing important is how long between sending the first word, and getting back the start of the audio. You know - the latency …

They advertise “2s to 4s” for target times. Which is really weird. Even the google non-streaming API gives 400ms or less - 10 times faster - and that’s not even their streaming endpoint - you get the entire sentence audio back in one go, before you can talk it. 400ms after you sent the text…

And you would get poorer transcription than a language model trained on long context of the entire passage to come.

I picked up a pair (of apples)
I picked up a pear (and some apples)
I picked up au pair (Jenny from the agency)