ChatCompletion stream to tts

Hi,

I am implementing a voice Ai where I want to generate a speech (via openai TTS) from the openai chatcompletion.
I realized that waiting for the chatcompletion to finish generating the context takes based on user prompt takes much time so I decided to try the “stream” feature.

collected_chunks = []
    collected_messages = []
    async for chunk in chat_response:
        collected_chunks.append(chunk)
        chunk_message = chunk.choices[0].delta.content  # extract the message
        collected_messages.append(chunk_message)  # save the message

    collected_messages = [m for m in collected_messages if m is not None]
    full_reply_content = ''.join([m for m in collected_messages])

My question is how can I use TTS to use these chunks of messages to convert to speech?
Will it be ideal to pass every chunks to tts and can tts able to handle theses chunks?
I believe if I pass every chunks to TTS, I will have to keep calling the “client.audio.speech.create” API to do so.

Is there anyone can provide me better design for this?

1 Like

Welcome to the developer forum @mariaclara.agent.ai

This is not the right approach to consume the TTS API, as it will quickly eat away at your requests per minute (RPM) limits.

Additionally, the OpenAI TTS, unlike other TTS, processes text differently based on the context present within the string. Sending text token-by-token would cause it to give undesired outputs.

Here’s a working implementation using threading. It links together a whole chain (you provide a promp, you start hearing the response while everything is still streaming) such that you can stream the audio response to a prompt. It works using threading by using one thread to stream the text reply into phrases which are enqueued for TTS. Then a second thread which TTS’s each phrase as it completes. And finally a third thread which starts playing out loud each phrase as it’s been TTS’d.

The final effect is much like working with the ChatGPT app where you get “streaming audio response” to your question and don’t have to wait to have the full text come back before you can start listening to audio. What’s here I’m sure could be improved and it’s primarily designed to show, in a terminal, it all put together.

I’m not sure why, but I’m not allowed to put links in my post, says the website. So you’ll have to assemble the following to see it.

gist[dot]github[dot]com/Ga68/3862688ab55b9d9b41256572b1fedc67