TTS API service usability

I would like to give a bit of feedback regarding the brand new TTS models and features. In my case, I’m using already other TTS services for my app (realtime conversations npcs in VR) and I would like to point what I would need to be able to integrate and use OpenAIs TTS service. In relevance order:

  • Response times: With services from competitors like Google Cloud TTS, Generally speaking, I’m gettings response times over 0.5 seconds, with OpenAI TTS there is no way to go under 3.5 or 4 seconds. Its very slow for my real time conversation case and the main reason to not to be an option to use.
  • Voices in English sounds very natural and believable, but in other languages (at least German and Spanish) it sounds like a foreigner with a good level of the language. I can not use it like this.
  • The app Im working is a global App, I would need to differentiate between culture localization, (British English vs American English and also Spanish from Spain or American Spanish)
  • Any speed rate parameter other than 1, produce a distortion on the output that make this parameter useless.
  • Other TTS system provides a pitch parameter and in my case is very useful to simulate other voices (also playing with the speech rate) with the same model.
  • It would be amazing the possibility to use an audio format that doesn’t require a lot of process, wav pcm or even adpcm it will be amazing and maybe also reduce some timings on the server and on client side compressing and decompressing.
6 Likes

Yeah, same observation. We need to have more control (pitch and rate) for each voice. It can become too monotonic without it. And language parameter.

2 Likes

You can get faster TTS response times by breaking up your text into lines and rendering each line individually.

Unfortunately that wont help much when the performance of the API itself is degraded.

I agree with the point about languages other than English, most of them sound like they are being read by someone with an American accent and not a native speaker.

100%. I host my own TTS interface and it’s almost necessary to run it line-by-line to achieve some sort of variety in emotions as well.

Mainly because the AI infers it from the line, but also because it’s proven that writing lines in a narrative style returns better results.

That is incredible! He said with skeptical excitement

Honestly makes a huge difference. Especially compared to sending a paragraph and wondering why it sounds monotone

There’s a lot left to be desired with the OpenAI TTS models. Let me just say that there are other models that are doing fucking amazing work with multi-language :drooling_face:

Shameless plug

1 Like

Figure the speech generation to run about 5x or 6x realtime. 18 seconds in 4.3 seconds total.

I just ran off a batch of tts-1 sentences earlier today with the completion done in an average of two seconds.

1.63s, 1.92s, 1.71s, 1.86s, 1.53s, 2.38s, 1.72s, 2.19s, 2.38s, 2.43s, 1.92s, 2.10s, 2.46s, 1.81s, 2.27s, 2.24s, 2.37s, 2.06s, 1.83s, 2.33s, 2.20s, 1.82s, 2.64s, 2.23s, 2.21s, 2.39s, 2.23s, 2.53s, 1.98s, 2.00s, 1.87s, 2.07s.

Individual sentences are not ideal. I used different voices necessitating the turns. Rejoining doesn’t have as natural a cadence as full text.

The AI can take a little bit of “stage directions” in square brackets before it starts to speak them aloud. You can even put “…” or [pause] within the speech.

I think I’ll leave you to your own world of conclusions.

Would anyone else like to chime in

I’ll leave examples for anyone to chime in.

Separate calls per sentence:

Single API call, one paragraph:

API call transcript

{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: “OpenAI’s Whisper AI is a groundbreaking tool that transforms spoken language into written text with impressive accuracy.”}
radio_tts-4_alloy__tts-1_alloy_20231215_164139.mp3 took 2.31 seconds
120 characters, cost 0.18 cents.
{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: ‘Its ability to recognize and transcribe voice audio in multiple languages makes it an invaluable resource for global communication.’}
radio_tts-4_alloy__tts-1_alloy_20231215_164142.mp3 took 2.59 seconds
131 characters, cost 0.1965 cents.
{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: ‘Whisper AI can facilitate accessibility, providing written transcripts for those who are deaf or hard of hearing.’}
radio_tts-4_alloy__tts-1_alloy_20231215_164145.mp3 took 1.99 seconds
113 characters, cost 0.1695 cents.
{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: ‘By converting lectures or meetings into text, it enhances productivity and ensures no critical information is missed.’}
radio_tts-4_alloy__tts-1_alloy_20231215_164147.mp3 took 2.35 seconds
117 characters, cost 0.1755 cents.
{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: ‘Moreover, its integration into various applications promises a future where voice-driven data entry is both seamless and efficient.’}
radio_tts-4_alloy__tts-1_alloy_20231215_164150.mp3 took 2.49 seconds
131 characters, cost 0.1965 cents.

I thought they were both good, but could definitely hear the “smash” that occurred when you had each sentence broken out.

Couldn’t this be cured by putting ‘…’ or [pause] in there?

Otherwise, not opposed to sticking in a blank mp3 to force the precise amount of silence.

But big paragraph is obviously good too.

I guess if we ignore

and

This would be worth discussing.

1 Like

Discuss!!!

What would make the models more competitive?

I don’t use TTS, so what is an example of one rendered in narrative style vs. not.

I am interested in TTS though … thinking of creating some sort of “reminder system” that uses TTS, with underlying text rendered with an LLM.

Well first let me tell you that I have just the system meant for you. For me, i’s been nice having a personal assistant with a voice similar to the Animal Planet narrator :shopping:

There’s a lot of parameters that typical TTS engines offer. Check it out:

(Not mentioned but they also offer SSML. So <break> instead of ... or <phoneme alphabet="ipa" ph="əˈbuːt">about</phoneme> to control the pronunciations

I offer a configuration option for each sentence to match the desired mood. Shit is expensive so it doesn’t make sense to generate a paragraph and cross fingers.

I think the most impressive feature that they offer is voice cloning. Less than 10 minutes of clear audio is enough to replicate.

1 Like

OK, so besides cool waveform generation … what about the hardware to play the voices?

So I was thinking of having some speaker sitting in the corner of the house, with a webhook configured, and I send the waveform data to the webhook and it plays the waveform.

So what hardware does this? Just interested mainly in a standalone speaker for fun, but open to a computer as worst case backup.

I will get back to you in business days. :person_in_tuxedo:

1 Like

I can dictate speech to text, send the text file to my printer’s server by ftp, and get a printout by push. (not today’s cloud garbage printers).

Pushing an mp3 onto a wi-fi attached server and having it auto-played seems like something you’d have to start at the rasperry pi level, with a DAC audio shield. Even mp3-accelerated audio board with an amp is elusive, let alone a very specific smart speaker application.

1 Like

I was hoping I could hack my Google Assistant speakers laying all over the house.

This is probably the solution … but I can’t be the only one wanting this?

An LLM powered speaker (via webhook)? And now with cloned voices? This will dominate. :rofl:

Yes, you are the only one in the world wanting this. :grinning: Everyone else would just bluetooth their ChatGPT app.

I’m reminded of tunneling over the company’s satellite connection to a remote store’s kiosk terminal, dump an HP driver’s intercepted-print-to-file to its print spooler directory set to another network printer, and print an award certificate for an employee on their invoice printer. Dept manager thought I was some superhacker.

So in conclusion, TTS does not have latency as high as OP reports, but takes reasonable time to generate.

1 Like

I was hoping to create a fun project in time for the holidays using TTS. I thought of making a TTS-1 Voice Choir by coaxing them to sing Silent Night, Holy Night a cappella. But perhaps not yet.

Here’s my “lyric sheet” lol

const message = `Silent... ni-ght, ho-ly ni-ght...\n` +
            `Aaall... is... calm..., All... is... bright...\n` +
            `Round... yon vir-gin, moth-er and child...\n` +
            `Ho-ly in-fant so... ten-der and... mild...\n` +
            `Sleep... in heav-en-ly... peace...\n` +
            `Sleep... in heav-en-ly peace...`

Here’s the audio outputs (via SoundCloud)

I have no mixer, but I tried playing them all at the same time using JS but it is not good, lol.

audioContext.value = new AudioContext()

  const files = [
      'voice-alloy.mp3',
      'voice-echo.mp3',
      'voice-fable.mp3',
      'voice-nova.mp3',
      'voice-onyx.mp3',
      'voice-shimmer.mp3',
  ]

  const audioBuffers = await Promise.all(
      Array.from(files).map(async (file) => {

          const resp = await fetch(`/audio/${file}`)

          const buffer = await resp.arrayBuffer()
          
          return await audioContext.value.decodeAudioData(buffer)

      })
  )

  let sources = []

  audioBuffers.forEach((buffer, i) => {
      sources[i] = audioContext.value.createBufferSource()
      sources[i].buffer = buffer
      sources[i].connect(audioContext.value.destination)
  })

  sources.forEach((source) => source.start())

Well, maybe next year…

1 Like

But you only need to do this for the first sentence which is typically a title or short introduction like “Certainly let me do this for you!” from an LLM.

This is often (but not always) a discrete line of text that can be read aloud on it’s own.

Then while the first audio chunk is playing, you can render the rest of the text at the LineBreak level which is typically a paragraph. This is the natural break in text/speech that allows the reader/narrator to switch context/emotion etc.

You may still need to break up larger paragraphs to keep a constant flow of speech but it’s the middle ground between sounding more natural and being responsive.

The only thing that really concerns me is the cost. Open AIs TTS seems a lot cheaper on paper than offerings from other vendors but still when I use it to read all output from GPT4 it seems to be about 5-10x more expensive than GP3-4 completions. So for now it’s not economically viable to use TTS for fully voice interactive real-time scenarios.


(Example. relative costs with TTS-1 turned on vs GPT3-4 using only Whisper)

I Hope the costs of TTS-1 can be reduced to the same level as GPT4 Turbo.
The price should eventually come down as I don’t see how TTS would need more compute resources than GPT4.