Streaming from Text-to-Speech api

marcel.vanworkum · May 28, 2024, 12:36am

For anyone looking to stream audio using the speech api:

import pyaudio
from openai import OpenAI

client = OpenAI()

p = pyaudio.PyAudio()
stream = p.open(format=8,
                channels=1,
                rate=24_000,
                output=True)

with client.audio.speech.with_streaming_response.create(
        model="tts-1",
        voice="alloy",
        input="""I see skies of blue and clouds of white
             The bright blessed days, the dark sacred nights
             And I think to myself
             What a wonderful world""",
        response_format="pcm"
) as response:
    for chunk in response.iter_bytes(1024):
        stream.write(chunk)

tichadok · May 30, 2024, 10:33am

That example show the audio streaming output but to acheive low latency you have to consider streaming the input also. In this example, the input is already generated and is provided as a single string, not a stream of words…

TomTom101 · May 30, 2024, 11:16am

Haven’t tested, but looks promising! Wonder whether with_streaming_response.create has always been there since this thread started?

TomTom101 · May 30, 2024, 11:16am

True, but that wasn’t part of my original question.

TomTom101 · May 30, 2024, 11:19am

Talking about realtime etc, join the crowed at GitHub - pipecat-ai/pipecat: Open Source framework for voice and multimodal conversational AI
I think it’s a brilliant framework that deals with the intricate details of transporting audio (and video) data across the interwebs as fast as possible. It’s in very early stages and contributors welcome!

nimobeeren · May 30, 2024, 3:17pm

It has been documented at least since March 3rd, see this post.

Pipecat looks cool btw!

nimobeeren · May 30, 2024, 3:20pm

If your input is a string of words, then it might make sense to split by sentence. That’ll generate more coherent speech than sending individual words/chunks to TTS

s.lukashenka · June 4, 2024, 12:36am

In one of the examples in the docs it shows streaming for the TTS endpoint like this:

from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello world! This is a streaming test.",
)

response.stream_to_file("output.mp3")

So in the end it still streams to file, but I’d like to access the data chunks as they come in code to send it over websocket.
How to handle the response from the API in nodejs to achieve it?

darcschnider · June 6, 2024, 8:55pm

here is a complete script for people if they need for a stand alone test. you can adjust the buffer on it etc.

import os
import requests
import openai
from pydub import AudioSegment
from pydub.playback import play
import io
from dotenv import load_dotenv
from queue import Queue
import threading
import time

Load environment variables from .env file

load_dotenv()

Set OpenAI API key

openai.api_key = os.getenv(‘OPENAI_API_KEY’)

def stream_audio_from_openai(api_url, params, headers, audio_queue):
response = requests.post(api_url, json=params, headers=headers, stream=True)
response.raise_for_status()

try:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            audio_queue.put(chunk)
except Exception as e:
    print(f"Failed to stream audio: {e}")
finally:
    audio_queue.put(None)  # Sentinel to indicate the end of stream

def play_audio_from_queue(audio_queue, initial_buffer_size=32768):
audio_buffer = io.BytesIO()
buffer_size = 0
playback_started = False

while True:
    chunk = audio_queue.get()
    if chunk is None:
        break
    if chunk:
        audio_buffer.write(chunk)
        buffer_size += len(chunk)
    
    if buffer_size >= initial_buffer_size and not playback_started:
        playback_started = True
        audio_buffer.seek(0)
        try:
            audio_segment = AudioSegment.from_file(audio_buffer, format="mp3")
            play(audio_segment)
            audio_buffer.seek(0)
            audio_buffer.truncate(0)
            buffer_size = 0
        except Exception as e:
            print(f"Error processing initial buffer: {e}")
            audio_buffer.seek(0)
            audio_buffer.truncate(0)
            buffer_size = 0

# Play any remaining buffered audio
if buffer_size > 0:
    audio_buffer.seek(0)
    try:
        audio_segment = AudioSegment.from_file(audio_buffer, format="mp3")
        play(audio_segment)
    except Exception as e:
        print(f"Error processing remaining buffer: {e}")

def generate_and_play_speech(text):
api_url = “https://api.openai.com/v1/audio/speech”
params = {
“model”: “tts-1”,
“voice”: “alloy”,
“input”: text
}
headers = {
“Authorization”: f"Bearer {os.getenv(‘OPENAI_API_KEY’)}",
“Content-Type”: “application/json”
}

audio_queue = Queue()
stream_thread = threading.Thread(target=stream_audio_from_openai, args=(api_url, params, headers, audio_queue))
play_thread = threading.Thread(target=play_audio_from_queue, args=(audio_queue, 32768))  # Adjust buffer size here

stream_thread.start()
play_thread.start()

stream_thread.join()
audio_queue.put(None)  # Ensure play thread finishes if streaming ends early
play_thread.join()

Example usage

generate_and_play_speech(“Hello world! This is a streaming test.”)

jaime3 · August 13, 2024, 12:44am

On a Raspberry PI 4 using latest Raspbian. Why would the following code not say each number properly?

import pyaudio
from openai import OpenAI

client = OpenAI()

p = pyaudio.PyAudio()
stream = p.open(format=8,
                channels=1,
                rate=24_000,
                output=True)

with client.audio.speech.with_streaming_response.create(
        model="tts-1",
        voice="alloy",
        input="""Uno, dos, tres, cuatro, cinco, seis, siete, ocho, nueve, diez.""",
        response_format="pcm"
) as response:
    for chunk in response.iter_bytes(1024):
        stream.write(chunk)

gokcerb · August 16, 2024, 7:09am

I used for my project, works perfect! speak seamlessly thank you

aleksmilanov · October 15, 2024, 8:13am

Streaming audio was a nightmare for me to implement.

I wrote about it in this article: https://medium.com/@aleksmilanov/the-ai-awakening-ab87546abd06

Happy to provide more detail / code.

If you are not running on a smart TV then it will be a lot easier to implement!

gdonoughe · November 22, 2024, 4:46pm

Does the .net sdk have streaming capability for the AudioClient?

daniel.w · January 21, 2025, 9:41pm

so i use express for my backend, how would one implement this with Websockets?

Topic		Replies	Views
Realtime API extremely expensive Feedback realtime	66	6942	December 4, 2024
How to decrease the latency of Text-To-Speech API? API gpt-4 , api	6	4025	April 26, 2024
[Realtime API] Audio is randomly cutting off at the end Bugs realtime	80	5034	May 3, 2025
Python integration of real time? API	13	3565	October 5, 2024
Connecting to the Realtime API API	45	7707	June 5, 2025

Streaming from Text-to-Speech api

Load environment variables from .env file

Set OpenAI API key

Example usage

Related topics