Whisper Streaming Strategy

The Whisper text to speech API does not yet support streaming. This would be a great feature.

I’m trying to think of ways I can take advantage of Whisper with my Assistant. A moderate response can take 7-10 sec to process, which is a bit slow. I’m considering breaking up the assistant’s text by sentences and simply sending over each sentence as it comes in. The down side is that Whisper won’t have any context for the sentences so it might sound unnatural. Any better ideas out there?

Hi @GoldenJoe

whisper is a speech-to-text model.

tts-1 and tts-1-hd are the models that are used to synthesize speech using text.

There’s a difference between how audio is streamed compared to text streaming. The text generation models use Server Sent Events (SSE) for streaming while the speech synthesis models use Chunk Transfer Encoding which means that audio can be played as the chunks arrive.

On TTS models, the first chunks of playable audio arrive around 3-4 seconds.

From my experience, TTS models do some sort of context aware speech synthesis where the sysnthesized audio varies very differently in speech intonation and emotion if the context changes. Hence sending the text in smaller chunks, will mean losing out on the context based intonation.

It will also eat away at your RPM rate-limits.

Here’s how I stream TTS audio:

from openai import OpenAI
import io
import pyaudio
import wave

client = OpenAI()

# Define a suitable buffer size for audio chunks (e.g., 4096 bytes)
BUFFER_SIZE = 100  # This is just an example size

def byte_stream_generator(response):
    """
    Generator function that yields a stream of bytes from the response.

    :param response: The response object from the OpenAI API call.
    """
    try:
        for byte_chunk in response.iter_bytes(chunk_size=BUFFER_SIZE):
            if byte_chunk:  # Only yield non-empty byte chunks
                yield byte_chunk
            else:
                print("Skipped an empty or corrupted packet")
    except Exception as e:
        print(f"Error while streaming bytes: {e}")

with client.audio.speech.with_streaming_response.create(
    model="tts-1-hd",
    voice="nova",
    input='(goofily): We saw you the other day at the beach making a sand castle. (with awe): "IT WAS THE BIGGEST sand castle I ever saw"',
    response_format= "wav",
) as response:
    try:
        # Initialize PyAudio
        p = pyaudio.PyAudio()

        # Open the stream
        stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, output=True)

        # Initialize the WAV header
        wav_header = None

        for audio_chunk in byte_stream_generator(response=response):
            # Check if this is the first chunk (WAV header)
            if wav_header is None:
                wav_header = audio_chunk
                # Extract the WAV format parameters from the header
                wav_format = wave.open(io.BytesIO(wav_header), 'rb')
                channels, samp_width, framerate, nframes, comptype, compname = wav_format.getparams()
                # Reopen the stream with the correct parameters
                stream = p.open(format=p.get_format_from_width(samp_width), channels=channels, rate=framerate, output=True)
            else:
                # Write the audio chunk to the stream
                stream.write(audio_chunk)

        # Close the stream and PyAudio
        stream.stop_stream()
        stream.close()
        p.terminate()
    except Exception as e:
        print(f"Error during playback: {e}")

print("Playback finished.")
3 Likes

Well sure, streaming the response is a no-brainer. If you are sending it to a client, you don’t even need pyaudio, you can just send the raw bytes over a socket connection.

But what I’m talking about is the pipeline of a true “conversational AI”. You talk to it, it talks back. Unfortunately without being able to provide the tts service a stream of input, you have to wait for the complete response from your chat/assistant. So your interface will print out the full message, and then read it to you. I understand WHY this is happening, I’m explaining it’s not a good experience, and asking if anyone has come up with a workaround I haven’t thought of.

You can do some kind of pre prediction.

I mean in alot of sentences you already know what the person you are talking with
is

going

to

say

pretty

early

in

the sentence

Therefore you may want to try to give the incomplete sentence to a model in the background complete it and get an answer based on that completion.

You may also cache the beginning of an answer as an mp3 file and ask the assistant/tts to let the answer be continued after the beginning - I don’t know if that makes any sense to you.

Also random scenes e.g. something dropped in the background and the bot says stuff like “ooopsie sorry I dropped a knife… not what you are thinking now hahaha. a knife I used to eat on my desk… yeah, I know sometimes I eat at my desk”… which gives the backend enough time to generate

The answer would start with “Anyways, back to your question…” which is streamed as a cached mp3 and assistant was instructed to continue the answer after “Anyways, back to your question”.

Also in support hotlines you will find that many questions are asked over and over again.

Caching the answers in general and even doing some embeddings with either mp3 file location or a function call to create the answer might be working as well

What I do is, I use a sentence splitter to only give full sentences to the TTS API. This helps getting a quick response while not losing too much of the context in speech snythesis.

I collect characters from the streamed LLM response and run the sentence splitter once every few received characters. Once I get a split that is longer than 1 (so I am sure there is one complete sentence) I give the complete part to the TTS system.

The sentence splitter I use is SaT (part of wtpsplit).

We developed a Streaming version of whisper on top of the whisper backbone. Google “whisper streaming GitHub” to find it.

Please feel free to try it out.