How to decrease the latency of Text-To-Speech API?

247-Teach · April 25, 2024, 2:01am

Hello all,

For context, I am using GPT4 API to perform Text-To-Speech and whenever I pass in a large amount of text the latency can reach up to almost a minute. I was wondering if there was a way to decrease the latency?

What I’ve found out so far. In the FAQ (https://help.openai.com/en/articles/8555505-tts-api), they mention that we can stream the audio in chunks by setting stream=True.

I’ve tried this by using the code below:

response = self.client.audio.speech.create(
model=“tts-1”,
voice=“shimmer”,
input=msg,
stream=True)

But I’m getting this error:
TypeError: Speech.create() got an unexpected keyword argument ‘stream’

So I’m a bit confused why it says that “stream” doesn’t exist when in their site it says that it does?

Any help is greatly appreciated

vb · April 25, 2024, 7:31am

Hi!

Here is the relevant info from the documentation:
https://platform.openai.com/docs/guides/text-to-speech/streaming-real-time-audio

In general it is preferable to refer the docs on the platform instead of the help files. The documentation is updated more frequently.

247-Teach · April 26, 2024, 3:42am

Hello, and thanks for responding!

I tried doing the method within the link you provided and it doesn’t actually seem like its streaming the audio. When I call on the stream_to_file(), the file gets generated but I can’t play it until the whole text has been processed.

vb · April 26, 2024, 7:27am

You need to handle audio data in chunks as they arrive rather than waiting for the entire file to be completed. In Python there are various libraries to achieve this, such as pyaudio, to play audio directly from byte streams.

dignity_for_all · April 26, 2024, 8:51am

The comment on the GitHub issue page mentions that the “stream_to_speakers” method became available with PyAudio on March 3rd.

github.com/openai/openai-python

TTS streaming does not work

opened 08:37PM - 22 Nov 23 UTC

closed 12:55AM - 03 Mar 24 UTC

meenie

bug

### Confirm this is an issue with the Python library and not an underlying OpenA…I API - [X] This is an issue with the Python library ### Describe the bug When following the documentation on how to use `client.audio.speech.create()`, the returned response has a method called `stream_to_file(file_path)` which explains that when used, it should stream the content of the audio file as it's being created. This does not seem to work. I've used a rather large text input that generates a 3.5 minute sound file and the file is only created once the whole request is completed. ### To Reproduce Utilize the following code and replace the text input with a decently large amount of text. ```python from pathlib import Path from openai import OpenAI client = OpenAI() speech_file_path = Path(__file__).parent / "speech.mp3" response = client.audio.speech.create( model="tts-1", voice="alloy", input=""" <Decently large bit of text here> """ ) response.stream_to_file(speech_file_path) ``` Notice that when the script is run that the `speech.mp3` file is only ever created _after_ the request is fully completed. ### Code snippets _No response_ ### OS macOS ### Python version Python 3.11.6 ### Library version openai v1.2.4

247-Teach · April 26, 2024, 10:10pm

Gotcha, I’ll try it out. Thanks!

_j · April 26, 2024, 10:36pm

Here’s receiving a stream of chunks into a buffer (and only afterwards playing WAV.)

import io
import pyaudio
from openai import OpenAI


def byteplay(bytestream):
    pya = pyaudio.PyAudio()
    stream = pya.open(format=pya.get_format_from_width(width=2), channels=1, rate=24000, output=True)
    stream.write(bytestream)
    stream.stop_stream()
    stream.close()

client = OpenAI()

with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="alloy",
    input="hello there, I'm making a WAV file today",
    response_format='wav'
) as response:
    # Initialize an empty bytes buffer
    buffer = io.BytesIO()
    
    # Read audio data from the generator
    for chunk in response.iter_bytes():
        print(len(chunk))
        buffer.write(chunk)

    # Go back to the start of the buffer
    buffer.seek(0)

    # Play the audio from the buffer
    byteplay(buffer.getvalue())
    #print(len(buffer.getvalue()))

So you can pull out your http chunks and stream the audio again.

Can WAV be instantly played? Yes, if you want immediate buffer underruns and noise. With a 100k preload threaded playback buffer, I only got a sentence to play before buffer underrun on my WiFi PC: the API doesn’t send uncompressed audio fast enough even at 24k/1ch. You also can’t know the final length from the stream to know how much to buffer even if you measure the stream rate.

buffered wav streaming with pyaudio

import pyaudio
import queue
import threading
from openai import OpenAI

def play_audio_data(pya, audio_queue):
    """ Plays audio chunks from the queue. """
    stream = pya.open(format=pya.get_format_from_width(width=2), channels=1, rate=24000, output=True)
    while True:
        chunk = audio_queue.get()
        if chunk is None:  # Sentinel value to stop the playback
            break
        stream.write(chunk)
    stream.stop_stream()
    stream.close()

def stream_audio(model: str, voice: str, input_text: str, initial_buffer_size: int = 150000):
    pya = pyaudio.PyAudio()
    audio_queue = queue.Queue()

    client = OpenAI()

    with client.audio.speech.with_streaming_response.create(
        model=model,
        voice=voice,
        input=input_text,
        response_format='wav'
    ) as response:
        buffer = b''  # Temporary buffer to accumulate initial chunks
        playback_started = False
        play_thread = None

        # Process each chunk only once
        for chunk in response.iter_bytes():
            if not playback_started:
                buffer += chunk
                # Check if initial buffer is sufficiently filled
                if len(buffer) >= initial_buffer_size:
                    # Start the playback thread once the buffer size is reached
                    audio_queue.put(buffer)  # Send the initial buffer to the queue
                    play_thread = threading.Thread(target=play_audio_data, args=(pya, audio_queue))
                    play_thread.start()
                    playback_started = True
                    buffer = b''  # Clear the initial buffer since it's now in the queue
            else:
                audio_queue.put(chunk)

        if not playback_started:
            # If the stream ends before filling the initial buffer, start playback with whatever we have
            audio_queue.put(buffer)
            play_thread = threading.Thread(target=play_audio_data, args=(pya, audio_queue))
            play_thread.start()

        # End signal for the playback thread
        audio_queue.put(None)
        if play_thread:
            play_thread.join()

        # Cleanup
        pya.terminate()

# Usage
stream_audio(model="tts-1", voice="alloy", input_text="hello there, I'm making a wav file today")

Thus we must go compressed (and I choose open source) to make this thing talk some paragraphs before a chat completions response is even done.

Unlike AAC, which is a raw stream and you have to mux it into a mp4 if you want a normal file, when specifying opus, the Opus is wrapped in OGG file stream.

An ogg audio is sent in small http chunks by OpenAI, but the actual OGG frames are large. The first two ogg frames have no audio, but lots of null tag space. The internal 20ms latency of Opus as a codec cannot be accessed. One must do a couple rounds of buffering, reassembling OGG packets, and decoding and playing those gapless. That puts at least a sentence delay before you’d be able to start playing without buffer overruns on a Python app. I’ll make it work though…

Just resending OGG packets to a browser, you can let that robust WebRTC client figure out the playing.

Topic		Replies	Views
How to replace my GPT TTS call for better performance? API tts , audio	1	225	November 5, 2024
ChatGPT API TTS streaming API api	3	4446	January 21, 2025
Streaming from Text-to-Speech api API api , python , tts	53	49767	January 21, 2025
ChatCompletion stream to tts API gpt-4 , gpt-35-turbo , chatgpt , api , tts	2	2625	June 19, 2024
How can I get acess to the TTS models? API tts	17	3438	November 14, 2023

How to decrease the latency of Text-To-Speech API?

Related topics