Eleven labs seem to be much faster than Open AI in text to speech (tts)

I ran some testing to compare the tts latency of eleven labs and open ai to process relative short text. Eleven labs seems to be about four times faster than open ai.

Three iterations of the test cases:

  • short: 10 words
  • medium: 24 words
  • long: 64 words

TTS Latency Test Results for open ai:

  • Number of tests: 9

  • Average generation time: 9.70 seconds

  • Average audio duration: 10.51 seconds

  • Average processing speed: 5.28 words/second

  • Average speaking speed: 3.36 words/second

TTS Latency Test Results for elevenlabs:

  • Number of tests: 9

  • Average generation time: 2.38 seconds

  • Average audio duration: 9.84 seconds

  • Average processing speed: 13.17 words/second

  • Average speaking speed: 3.44 words/second

The title says “speech to text” but I guess you meant “text to speech” :slight_smile:
I don’t know about your setup but that generation time depends on so many things. For example the exact model(s) you’re using. For OpenAI TTS, that could also be the HD model (which takes longer). 11labs also has 2 model families: standard and turbo with different models.

I always recommend Text to Speech Models and Providers Leaderboard | Artificial Analysis for a detailed comparison!

3 Likes

Elevenlabs is surely in the vanguard in terms of quality, but I personally have greater expectatives with openai because it is more cost effectively and offers a greater variety of services in their API.
That said, there is a lot of room for improvement on openai TTS services, and I hope they focus on that because one bottleneck of developing interactive services is that no matter how good the AI response is, people will still judge it by the quality of the voice or the capability to correctly transcribe speech audio.