TTS API Speed and Quality Issues


Hello. I’ve adopted the new TTS API in my Text to Speech apps which allow people to listen to websites and PDFs. They now offer OpenAI’s voices along with AWS Polly, Azure and Google cloud. Users can choose between them.

The OpenAI voices are excellent in terms of realism but for one, they are slow (response times are definitely lagging compared to competitors), and two, they sometimes skip phrases and on occasion, entire paragraphs. For example sometimes when submitting a single word, the API will return silence. And the problem gets worse with non-english languages. I have a French user who is reporting that entire paragraphs are skipped regularly.

Unlike the likes of ElevelLabs and Murf, I am happy that OpenAI structured its pricing similar to its big tech competitors like Azure and Polly because it fit perfectly with my existing model and adding the voices to my apps was a no brainer. However, the quality is a concern. I am willing to put up with some level of “early adopter tax” but I have yet to see any improvements since adopting the new voices at launch.

Is work being done on improving the response times and addressing the silences?

Also, this is a bit of an aside, but it would also be awesome if OpenAI could provide speech marks. I also offer real-time word highlighting in my apps but I have to estimate it with OpenAI’s voices. It would be great if it were more accurate.

If OpenAI can succeed at addressing the speed and quality issues (and add speech marks), they’d definitely be at the top in terms of TTS APIs out there.

1 Like

I’m not sure what pricing you’re looking at?

  • OpenAI TTS HD is $0.30/1k
  • EL (HD) is $0.18/1k.

There’s no official comments on this but I think it’s safe to assume that yes regarding response times.

In terms of silence: maybe. Silence is really hard to accomplish using AI TTS. Other TTS providers recommend splitting it yourself, and using SSML (but can distort the voice).

They have so many branches now it’s hard to know what they are focusing on and when it will be updated. In the meantime I have been switching from TTS to STS (Speech-To-Speech), which inherently includes tone, expression, and pausing.

It’s how they tier things by Free/Starter/Creator/Pro/Whatever . Each has its own character limits. I don’t want character limits because I don’t know how may characters my customers will consume over a given period. I like the model pioneered by AWS Polly (I think they were first) where you pay based on what you use and that’s it. No subscription plans and the like

I wasn’t clear. What I mean by silences was how the AI sometimes returns silence as a response to single words and sometimes entire paragraphs.

1 Like

Gotcha. Yeah the subscription does suck if the characters aren’t used.

I also run an interface for EL and just figured that if my monthly subscription covers N tokens per month as well it works out. There is a slight discount on the subscribed tokens.

Ah. Sorry I misread that. It would be nice we could fine-tune/create models like EL. I’m sure OpenAI knows these things and maybe eventually will release them. Or it could be that they released these TTS models because :person_shrugging: they made them for ChatGPT and might as well also offer them for API.

TTS acts weird when there are characters like ’ in the text. If possible get rid of such characters.

It’s not that simple. There are other cases as well (such as single words as I described).