I ran some testing to compare the tts latency of eleven labs and open ai to process relative short text. Eleven labs seems to be about four times faster than open ai.
The title says “speech to text” but I guess you meant “text to speech”
I don’t know about your setup but that generation time depends on so many things. For example the exact model(s) you’re using. For OpenAI TTS, that could also be the HD model (which takes longer). 11labs also has 2 model families: standard and turbo with different models.
Elevenlabs is surely in the vanguard in terms of quality, but I personally have greater expectatives with openai because it is more cost effectively and offers a greater variety of services in their API.
That said, there is a lot of room for improvement on openai TTS services, and I hope they focus on that because one bottleneck of developing interactive services is that no matter how good the AI response is, people will still judge it by the quality of the voice or the capability to correctly transcribe speech audio.
I found Lemonfox.ai to be great alternative to Elevenlabs and OpenAI TTS. It’s quite fast, offers an OpenAI and Elevenlabs-compatible API and is much cheaper than both of them.
latency means delay.
“Average generation time…” is completely irrelevant.
Nobody cares how long it takes to convert text to audio, the only thing important is how long between sending the first word, and getting back the start of the audio. You know - the latency …
They advertise “2s to 4s” for target times. Which is really weird. Even the google non-streaming API gives 400ms or less - 10 times faster - and that’s not even their streaming endpoint - you get the entire sentence audio back in one go, before you can talk it. 400ms after you sent the text…