Huge problems with TTS API

dev5 · April 27, 2024, 3:33am

I tried all of the voices on the tts-1-hd model, and there is a high error rate pronouncing single English words rather than entire sentences.

There will either be total silence, or
The word will be pronounced so quickly it can barely be understood
The end of the word is cut off (seems to be more frequent with ending punctuation - period)
On some short words (4 characters or less), it will pronounce the A followed by a hyphen as “ah” or omit the A entirely
I attempted to correct #4 by replacing “A-” in the input with "Ayy " and the accent for that word shifts from American English to Australian. Adding a period after the single word also causes the pronunciation of “Ayy” to become incorrect
Setting the speech rate less than 1 (I tried 0.85) produces an uncanny echo-ish sound which makes the voice sound 100% robotic.

On #1: Relieved by prefixing input with "[pause] " (note space after – no space seems to cause problems sometimes).

On #6: Relieved by producing audio at speed=1 and then using ffmpeg to atempo it down to slower speed.

It looks like different locales are trained in the same model, maybe retrain them independently and allow specifying a locale to be passed as a request parameter?

_j · April 27, 2024, 4:43am

The first question:
Have you found this only on tts-1-hd? I’ve barely found any difference between the two models. One perhaps may be larger, or have lower word error rate in synthesizing audio, but I couldn’t find what that would be.

For #6 - it’s clear, by recognizing the artifacts, that for the playback speed, they are just post-processing the audio with a time-stretch algorithm that uses time slicing to maintain the pitch.

Different inputs:
What improvement have you had with specifying the language parameter? How about including a prompt leading up to the word, when recognizing a one-word input in code. I’m thinking something like “Today, our word of the day is”.

The symptom was previously explored on particular words the AI couldn’t say. Example words for us?

You have a good idea there of surrounding it with “stage directions” in brackets. You can also try … (elide markers) on multiple lines surrounding the speech.

The “fable” API voice seems to be non-American (and easily re-gendered), so you can see how you can build on that. To me, it takes more than a word to attune yourself to its output.

(also, you might pad the output wav audio with some digital silence, so you can be sure you are not experiencing a problem with the playback device needing initialization time, or the program terminating before the buffer is rendered.)

robin96986679 · April 28, 2024, 7:29am

I encountered the similar issue today. The generated audio is silent when I request it using single-word inputs such as: ‘ChatGPT’, ‘GPT’, ‘TTS’. However, if the input is a sentence, the generated audio is correct. For example: ‘The ChatGPT is very good.’

This issue can be reproduced with all six voices and both the tts-1 and tts-1-hd models.

peterhartree · May 27, 2024, 8:56am

I’ve had a similar problem. Discussion and a partial workaround here.

EduGPT · May 27, 2024, 1:44pm

For us is the speed.

Goes from 600ms to 3secs sometimes for no reason.

Topic		Replies	Views
[Text to Speech API] Chinese TTS unreliable and unusable API	6	2100	May 16, 2024
TTS API Speed and Quality Issues API api , tts	5	3179	February 6, 2024
[TTS] Flawed by design for non-english languages. Here's why API	3	569	November 17, 2023
TTS is unpredictable and often really wrong for non-English requests API tts	7	898	January 15, 2025
Text to voice generate 13 minutes noise sound for a small text Bugs api	1	27	January 10, 2025

Huge problems with TTS API

Related topics