Streaming from Text-to-Speech api

nimobeeren · February 22, 2024, 9:57pm

I was able to get some preliminary results with streaming + playing back audio in real-time using pyaudio:

import os
import requests
from time import time
import pyaudio

url = "https://api.openai.com/v1/audio/speech"
headers = {
    "Authorization": f'Bearer {os.getenv("OPENAI_API_KEY")}',
}

data = {
    "model": "tts-1",
    "input": "This is a test",
    "voice": "shimmer",
    "response_format": "wav",
}

start_time = time()
response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
    print(f"Time to first byte: {int((time() - start_time) * 1000)} ms")
    p = pyaudio.PyAudio()
    stream = p.open(format=8, channels=1, rate=24000, output=True)
    for chunk in response.iter_content(chunk_size=1024):
        stream.write(chunk)
    print(f"Time to complete: {int((time() - start_time) * 1000)} ms")

HEADPHONE WARNING: this can cause very harsh noise, especially on longer inputs. Keep volume low.

The best part is that this drastically reduces latency to about 200-500 ms (time to first byte). I found latency was lowest with "response_format": "wav", though the trade-off is larger file size. But on any decent connection, the bottleneck will still be generation speed, not network.

I got the values for format, channels and rate by writing the stream to a .wav file and analyzing it as per pyaudio docs:

import os
import requests
import io
import wave
import pyaudio

# url = ...
# headers = ...
# data = ...

response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
    buffer = io.BytesIO()
    for chunk in response.iter_content(chunk_size=1024):
        buffer.write(chunk)

with open("speech.wav", "wb") as f:
    f.write(buffer.getvalue())

with wave.open('speech.wav', 'rb') as wf:
    p = pyaudio.PyAudio()

    print('format', p.get_format_from_width(wf.getsampwidth()))
    print('channels', wf.getnchannels())
    print('rate', wf.getframerate())

But my guess is that this is wrong, and is the cause of the intermittent noise.

Topic		Replies	Views
Chat completions audio output but not base64 encoded string API chat-completion , speech	5	450	October 10, 2025
Realtime API extremely expensive Feedback realtime	66	8696	December 4, 2024
How to decrease the latency of Text-To-Speech API? API gpt-4 , api	6	5153	April 26, 2024
Python integration of real time? API	13	4145	October 5, 2024
Connecting to the Realtime API API	45	9336	June 5, 2025

Streaming from Text-to-Speech api

Related topics