I’m using the Text To Speech API tts-1 and it’s working quite well, however when I try to make it read out lists of items, it occasionally doesn’t read the number at the beginning of the item and it often doesn’t pause between saying the number and the word after it.

for example:

Sure! Here are 5 fruits:

  1. Apple
  2. Banana
  3. Orange
  4. Strawberry
  5. Grape

Is read aloud as.

Sure! Here are 5 fruits:

  1. Apple
  2. Banana
    3 Orange
    4 Strawberry-Grape

I have tried this with the voices Echo and Shimmer and it seems to happen almost every time you get the model to read out a list.

Are there any tips for making the model do a brief pause between the number and the first word?

Note: this forum doesn’t allow me to upload mp3 files but this bug is fairly easy to replicate. just get the model to read out a list of items produced by ChatGPT.


Pausing is traditionally hard with TTS.

Try adding some “…” or even “-” after each listed item.

Sure! Here are 5 fruits:

   1. Apple...
   2. Banana...

This is from ElevenLabs docs but I believe it carries over (they have their own syntax for handling pauses now as well)

These options are inconsistent and might not always work. We recommend using the syntax above for consistency.

One trick that seems to provide the most consistence output - sans the above option - is a simple dash - or the em-dash . You can even add multiple dashes such as -- -- for a longer puase.

"It - is - getting late."

Ellipsis ... can sometimes also work to add a pause between words but usually also adds some “hesitation” or “nervousness” to the voice that might not always fit.

I... yeah, I guess so..."
Update: It seems like the model will still drop numbers if even if you put … after each line item which will still cause two line items to be read as a single item.

After some experimentation, it seems like this is the best workaround that I could come up with while still having the audio sound fairly natural.

Sure! Here are 5 fruits:

One: Apple…
Two: Banana…
Three: Orange…
Four: Strawberry…
Five: Grape…

Using words instead of numbers seems to increase the chance that they will be read aloud by the model but does not eliminate the problem entirely.

