TTS API service usability

vr4content · November 8, 2023, 8:15am

I would like to give a bit of feedback regarding the brand new TTS models and features. In my case, I’m using already other TTS services for my app (realtime conversations npcs in VR) and I would like to point what I would need to be able to integrate and use OpenAIs TTS service. In relevance order:

Response times: With services from competitors like Google Cloud TTS, Generally speaking, I’m gettings response times over 0.5 seconds, with OpenAI TTS there is no way to go under 3.5 or 4 seconds. Its very slow for my real time conversation case and the main reason to not to be an option to use.
Voices in English sounds very natural and believable, but in other languages (at least German and Spanish) it sounds like a foreigner with a good level of the language. I can not use it like this.
The app Im working is a global App, I would need to differentiate between culture localization, (British English vs American English and also Spanish from Spain or American Spanish)
Any speed rate parameter other than 1, produce a distortion on the output that make this parameter useless.
Other TTS system provides a pitch parameter and in my case is very useful to simulate other voices (also playing with the speech rate) with the same model.
It would be amazing the possibility to use an audio format that doesn’t require a lot of process, wav pcm or even adpcm it will be amazing and maybe also reduce some timings on the server and on client side compressing and decompressing.

supershaneski · November 8, 2023, 8:56am

Yeah, same observation. We need to have more control (pitch and rate) for each voice. It can become too monotonic without it. And language parameter.

evardion · December 15, 2023, 10:14pm

You can get faster TTS response times by breaking up your text into lines and rendering each line individually.

Unfortunately that wont help much when the performance of the API itself is degraded.

I agree with the point about languages other than English, most of them sound like they are being read by someone with an American accent and not a native speaker.

anon10827405 · December 15, 2023, 11:56pm

100%. I host my own TTS interface and it’s almost necessary to run it line-by-line to achieve some sort of variety in emotions as well.

Mainly because the AI infers it from the line, but also because it’s proven that writing lines in a narrative style returns better results.

That is incredible! He said with skeptical excitement

Honestly makes a huge difference. Especially compared to sending a paragraph and wondering why it sounds monotone

There’s a lot left to be desired with the OpenAI TTS models. Let me just say that there are other models that are doing fucking amazing work with multi-language

Shameless plug

_j · December 16, 2023, 12:28am

Figure the speech generation to run about 5x or 6x realtime. 18 seconds in 4.3 seconds total.

I just ran off a batch of tts-1 sentences earlier today with the completion done in an average of two seconds.

1.63s, 1.92s, 1.71s, 1.86s, 1.53s, 2.38s, 1.72s, 2.19s, 2.38s, 2.43s, 1.92s, 2.10s, 2.46s, 1.81s, 2.27s, 2.24s, 2.37s, 2.06s, 1.83s, 2.33s, 2.20s, 1.82s, 2.64s, 2.23s, 2.21s, 2.39s, 2.23s, 2.53s, 1.98s, 2.00s, 1.87s, 2.07s.

Individual sentences are not ideal. I used different voices necessitating the turns. Rejoining doesn’t have as natural a cadence as full text.

The AI can take a little bit of “stage directions” in square brackets before it starts to speak them aloud. You can even put “…” or [pause] within the speech.

anon10827405 · December 16, 2023, 12:32am

I think I’ll leave you to your own world of conclusions.

Would anyone else like to chime in

_j · December 16, 2023, 12:51am

I’ll leave examples for anyone to chime in.

Separate calls per sentence:

Single API call, one paragraph:

API call transcript

{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: “OpenAI’s Whisper AI is a groundbreaking tool that transforms spoken language into written text with impressive accuracy.”}
radio_tts-4_alloy__tts-1_alloy_20231215_164139.mp3 took 2.31 seconds
120 characters, cost 0.18 cents.
{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: ‘Its ability to recognize and transcribe voice audio in multiple languages makes it an invaluable resource for global communication.’}
radio_tts-4_alloy__tts-1_alloy_20231215_164142.mp3 took 2.59 seconds
131 characters, cost 0.1965 cents.
{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: ‘Whisper AI can facilitate accessibility, providing written transcripts for those who are deaf or hard of hearing.’}
radio_tts-4_alloy__tts-1_alloy_20231215_164145.mp3 took 1.99 seconds
113 characters, cost 0.1695 cents.
{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: ‘By converting lectures or meetings into text, it enhances productivity and ensures no critical information is missed.’}
radio_tts-4_alloy__tts-1_alloy_20231215_164147.mp3 took 2.35 seconds
117 characters, cost 0.1755 cents.
{‘voice’: ‘alloy’, ‘model’: ‘tts-1’, ‘response_format’: ‘mp3’, ‘input’: ‘Moreover, its integration into various applications promises a future where voice-driven data entry is both seamless and efficient.’}
radio_tts-4_alloy__tts-1_alloy_20231215_164150.mp3 took 2.49 seconds
131 characters, cost 0.1965 cents.

curt.kennedy · December 16, 2023, 1:08am

I thought they were both good, but could definitely hear the “smash” that occurred when you had each sentence broken out.

Couldn’t this be cured by putting ‘…’ or [pause] in there?

Otherwise, not opposed to sticking in a blank mp3 to force the precise amount of silence.

But big paragraph is obviously good too.

anon10827405 · December 16, 2023, 1:21am

I guess if we ignore

and

This would be worth discussing.

curt.kennedy · December 16, 2023, 1:25am

Discuss!!!

What would make the models more competitive?

I don’t use TTS, so what is an example of one rendered in narrative style vs. not.

I am interested in TTS though … thinking of creating some sort of “reminder system” that uses TTS, with underlying text rendered with an LLM.

anon10827405 · December 16, 2023, 1:31am

Well first let me tell you that I have just the system meant for you. For me, i’s been nice having a personal assistant with a voice similar to the Animal Planet narrator

There’s a lot of parameters that typical TTS engines offer. Check it out:

(Not mentioned but they also offer SSML. So <break> instead of ... or <phoneme alphabet="ipa" ph="əˈbuːt">about</phoneme> to control the pronunciations

I offer a configuration option for each sentence to match the desired mood. Shit is expensive so it doesn’t make sense to generate a paragraph and cross fingers.

I think the most impressive feature that they offer is voice cloning. Less than 10 minutes of clear audio is enough to replicate.

curt.kennedy · December 16, 2023, 1:47am

OK, so besides cool waveform generation … what about the hardware to play the voices?

So I was thinking of having some speaker sitting in the corner of the house, with a webhook configured, and I send the waveform data to the webhook and it plays the waveform.

So what hardware does this? Just interested mainly in a standalone speaker for fun, but open to a computer as worst case backup.

anon10827405 · December 16, 2023, 1:55am

I will get back to you in business days.

_j · December 16, 2023, 2:21am

I can dictate speech to text, send the text file to my printer’s server by ftp, and get a printout by push. (not today’s cloud garbage printers).

Pushing an mp3 onto a wi-fi attached server and having it auto-played seems like something you’d have to start at the rasperry pi level, with a DAC audio shield. Even mp3-accelerated audio board with an amp is elusive, let alone a very specific smart speaker application.

curt.kennedy · December 16, 2023, 2:23am

I was hoping I could hack my Google Assistant speakers laying all over the house.

This is probably the solution … but I can’t be the only one wanting this?

An LLM powered speaker (via webhook)? And now with cloned voices? This will dominate.

_j · December 16, 2023, 2:36am

Yes, you are the only one in the world wanting this. Everyone else would just bluetooth their ChatGPT app.

I’m reminded of tunneling over the company’s satellite connection to a remote store’s kiosk terminal, dump an HP driver’s intercepted-print-to-file to its print spooler directory set to another network printer, and print an award certificate for an employee on their invoice printer. Dept manager thought I was some superhacker.

So in conclusion, TTS does not have latency as high as OP reports, but takes reasonable time to generate.

supershaneski · December 16, 2023, 5:10am

I was hoping to create a fun project in time for the holidays using TTS. I thought of making a TTS-1 Voice Choir by coaxing them to sing Silent Night, Holy Night a cappella. But perhaps not yet.

Here’s my “lyric sheet” lol

const message = `Silent... ni-ght, ho-ly ni-ght...\n` +
            `Aaall... is... calm..., All... is... bright...\n` +
            `Round... yon vir-gin, moth-er and child...\n` +
            `Ho-ly in-fant so... ten-der and... mild...\n` +
            `Sleep... in heav-en-ly... peace...\n` +
            `Sleep... in heav-en-ly peace...`

Here’s the audio outputs (via SoundCloud)

I have no mixer, but I tried playing them all at the same time using JS but it is not good, lol.

audioContext.value = new AudioContext()

  const files = [
      'voice-alloy.mp3',
      'voice-echo.mp3',
      'voice-fable.mp3',
      'voice-nova.mp3',
      'voice-onyx.mp3',
      'voice-shimmer.mp3',
  ]

  const audioBuffers = await Promise.all(
      Array.from(files).map(async (file) => {

          const resp = await fetch(`/audio/${file}`)

          const buffer = await resp.arrayBuffer()
          
          return await audioContext.value.decodeAudioData(buffer)

      })
  )

  let sources = []

  audioBuffers.forEach((buffer, i) => {
      sources[i] = audioContext.value.createBufferSource()
      sources[i].buffer = buffer
      sources[i].connect(audioContext.value.destination)
  })

  sources.forEach((source) => source.start())

Well, maybe next year…

evardion · December 16, 2023, 7:29am

But you only need to do this for the first sentence which is typically a title or short introduction like “Certainly let me do this for you!” from an LLM.

This is often (but not always) a discrete line of text that can be read aloud on it’s own.

Then while the first audio chunk is playing, you can render the rest of the text at the LineBreak level which is typically a paragraph. This is the natural break in text/speech that allows the reader/narrator to switch context/emotion etc.

You may still need to break up larger paragraphs to keep a constant flow of speech but it’s the middle ground between sounding more natural and being responsive.

The only thing that really concerns me is the cost. Open AIs TTS seems a lot cheaper on paper than offerings from other vendors but still when I use it to read all output from GPT4 it seems to be about 5-10x more expensive than GP3-4 completions. So for now it’s not economically viable to use TTS for fully voice interactive real-time scenarios.

(Example. relative costs with TTS-1 turned on vs GPT3-4 using only Whisper)

I Hope the costs of TTS-1 can be reduced to the same level as GPT4 Turbo.
The price should eventually come down as I don’t see how TTS would need more compute resources than GPT4.

Topic		Replies	Views
I don't understand the pricing for the realtime API API realtime	33	16663	October 8, 2024
New TTS API pricing and gotchas API	8	2140	March 25, 2025
TTS API Speed and Quality Issues API api , tts	5	4011	February 6, 2024
New Realtime API voices and cache pricing Announcements realtime , prompt-caching	26	9416	November 27, 2024
Realtime API updates — WebRTC, cheaper prices, 4o-mini, and more Announcements	26	7800	December 29, 2024

TTS API service usability

Related topics