Streaming text in and audio out?



I’m curious if there are some possibilities to stream in text from a text model like gpt-3.5 directly into the tts endpoint and stream the response as an output.

Even though streaming the audio output is possible, waiting for the entire text to finish before generating the audio stream results in too much latency.

1 Like

Welcome to the forum Simon!

I didn’t do this myself, but a friend who did, told me he used async text generation and was sending full chunks of text (like sentences) to text2speech (rather ather than waiting for the whole text to generate).


Ah, of course! Thats a good idea, but hopefully the implement text and speech generation into a single endpoint.

1 Like

Here’s a working implementation using threading. It links together a whole chain (you provide a promp, you start hearing the response while everything is still streaming) such that you can stream the audio response to a prompt. It works using threading by using one thread to stream the text reply into phrases which are enqueued for TTS. Then a second thread which TTS’s each phrase as it completes. And finally a third thread which starts playing out loud each phrase as it’s been TTS’d.

The final effect is much like working with the ChatGPT app where you get “streaming audio response” to your question and don’t have to wait to have the full text come back before you can start listening to audio. What’s here I’m sure could be improved and it’s primarily designed to show, in a terminal, it all put together.

I’m not sure why, but I’m not allowed to put links in my post, says the website. So you’ll have to assemble the following to see it.