Huge problems with TTS API

I tried all of the voices on the tts-1-hd model, and there is a high error rate pronouncing single English words rather than entire sentences.

  1. There will either be total silence, or
  2. The word will be pronounced so quickly it can barely be understood
  3. The end of the word is cut off (seems to be more frequent with ending punctuation - period)
  4. On some short words (4 characters or less), it will pronounce the A followed by a hyphen as “ah” or omit the A entirely
  5. I attempted to correct #4 by replacing “A-” in the input with "Ayy " and the accent for that word shifts from American English to Australian. Adding a period after the single word also causes the pronunciation of “Ayy” to become incorrect
  6. Setting the speech rate less than 1 (I tried 0.85) produces an uncanny echo-ish sound which makes the voice sound 100% robotic.

On #1: Relieved by prefixing input with "[pause] " (note space after – no space seems to cause problems sometimes).

On #6: Relieved by producing audio at speed=1 and then using ffmpeg to atempo it down to slower speed.

It looks like different locales are trained in the same model, maybe retrain them independently and allow specifying a locale to be passed as a request parameter?

1 Like

The first question:
Have you found this only on tts-1-hd? I’ve barely found any difference between the two models. One perhaps may be larger, or have lower word error rate in synthesizing audio, but I couldn’t find what that would be.

For #6 - it’s clear, by recognizing the artifacts, that for the playback speed, they are just post-processing the audio with a time-stretch algorithm that uses time slicing to maintain the pitch.

Different inputs:
What improvement have you had with specifying the language parameter? How about including a prompt leading up to the word, when recognizing a one-word input in code. I’m thinking something like “Today, our word of the day is”.

The symptom was previously explored on particular words the AI couldn’t say. Example words for us?

You have a good idea there of surrounding it with “stage directions” in brackets. You can also try … (elide markers) on multiple lines surrounding the speech.

The “fable” API voice seems to be non-American (and easily re-gendered), so you can see how you can build on that. To me, it takes more than a word to attune yourself to its output.

(also, you might pad the output wav audio with some digital silence, so you can be sure you are not experiencing a problem with the playback device needing initialization time, or the program terminating before the buffer is rendered.)

I encountered the similar issue today. The generated audio is silent when I request it using single-word inputs such as: ‘ChatGPT’, ‘GPT’, ‘TTS’. However, if the input is a sentence, the generated audio is correct. For example: ‘The ChatGPT is very good.’

This issue can be reproduced with all six voices and both the tts-1 and tts-1-hd models.

I’ve had a similar problem. Discussion and a partial workaround here.

For us is the speed.

Goes from 600ms to 3secs sometimes for no reason.