Streaming from Text-to-Speech api

For anyone looking to stream audio using the speech api:

import pyaudio
from openai import OpenAI

client = OpenAI()

p = pyaudio.PyAudio()
stream = p.open(format=8,
                channels=1,
                rate=24_000,
                output=True)

with client.audio.speech.with_streaming_response.create(
        model="tts-1",
        voice="alloy",
        input="""I see skies of blue and clouds of white
             The bright blessed days, the dark sacred nights
             And I think to myself
             What a wonderful world""",
        response_format="pcm"
) as response:
    for chunk in response.iter_bytes(1024):
        stream.write(chunk)

6 Likes

That example show the audio streaming output but to acheive low latency you have to consider streaming the input also. In this example, the input is already generated and is provided as a single string, not a stream of words…

Haven’t tested, but looks promising! Wonder whether with_streaming_response.create has always been there since this thread started?

True, but that wasn’t part of my original question.

Talking about realtime etc, join the crowed at GitHub - pipecat-ai/pipecat: Open Source framework for voice and multimodal conversational AI
I think it’s a brilliant framework that deals with the intricate details of transporting audio (and video) data across the interwebs as fast as possible. It’s in very early stages and contributors welcome!

It has been documented at least since March 3rd, see this post.

Pipecat looks cool btw!

1 Like

If your input is a string of words, then it might make sense to split by sentence. That’ll generate more coherent speech than sending individual words/chunks to TTS

In one of the examples in the docs it shows streaming for the TTS endpoint like this:

from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello world! This is a streaming test.",
)

response.stream_to_file("output.mp3")

So in the end it still streams to file, but I’d like to access the data chunks as they come in code to send it over websocket.
How to handle the response from the API in nodejs to achieve it?

here is a complete script for people if they need for a stand alone test. you can adjust the buffer on it etc.

import os
import requests
import openai
from pydub import AudioSegment
from pydub.playback import play
import io
from dotenv import load_dotenv
from queue import Queue
import threading
import time

Load environment variables from .env file

load_dotenv()

Set OpenAI API key

openai.api_key = os.getenv(‘OPENAI_API_KEY’)

def stream_audio_from_openai(api_url, params, headers, audio_queue):
response = requests.post(api_url, json=params, headers=headers, stream=True)
response.raise_for_status()

try:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            audio_queue.put(chunk)
except Exception as e:
    print(f"Failed to stream audio: {e}")
finally:
    audio_queue.put(None)  # Sentinel to indicate the end of stream

def play_audio_from_queue(audio_queue, initial_buffer_size=32768):
audio_buffer = io.BytesIO()
buffer_size = 0
playback_started = False

while True:
    chunk = audio_queue.get()
    if chunk is None:
        break
    if chunk:
        audio_buffer.write(chunk)
        buffer_size += len(chunk)
    
    if buffer_size >= initial_buffer_size and not playback_started:
        playback_started = True
        audio_buffer.seek(0)
        try:
            audio_segment = AudioSegment.from_file(audio_buffer, format="mp3")
            play(audio_segment)
            audio_buffer.seek(0)
            audio_buffer.truncate(0)
            buffer_size = 0
        except Exception as e:
            print(f"Error processing initial buffer: {e}")
            audio_buffer.seek(0)
            audio_buffer.truncate(0)
            buffer_size = 0

# Play any remaining buffered audio
if buffer_size > 0:
    audio_buffer.seek(0)
    try:
        audio_segment = AudioSegment.from_file(audio_buffer, format="mp3")
        play(audio_segment)
    except Exception as e:
        print(f"Error processing remaining buffer: {e}")

def generate_and_play_speech(text):
api_url = “https://api.openai.com/v1/audio/speech”
params = {
“model”: “tts-1”,
“voice”: “alloy”,
“input”: text
}
headers = {
“Authorization”: f"Bearer {os.getenv(‘OPENAI_API_KEY’)}",
“Content-Type”: “application/json”
}

audio_queue = Queue()
stream_thread = threading.Thread(target=stream_audio_from_openai, args=(api_url, params, headers, audio_queue))
play_thread = threading.Thread(target=play_audio_from_queue, args=(audio_queue, 32768))  # Adjust buffer size here

stream_thread.start()
play_thread.start()

stream_thread.join()
audio_queue.put(None)  # Ensure play thread finishes if streaming ends early
play_thread.join()

Example usage

generate_and_play_speech(“Hello world! This is a streaming test.”)

1 Like

On a Raspberry PI 4 using latest Raspbian. Why would the following code not say each number properly?

import pyaudio
from openai import OpenAI

client = OpenAI()

p = pyaudio.PyAudio()
stream = p.open(format=8,
                channels=1,
                rate=24_000,
                output=True)

with client.audio.speech.with_streaming_response.create(
        model="tts-1",
        voice="alloy",
        input="""Uno, dos, tres, cuatro, cinco, seis, siete, ocho, nueve, diez.""",
        response_format="pcm"
) as response:
    for chunk in response.iter_bytes(1024):
        stream.write(chunk)
1 Like

I used for my project, works perfect! speak seamlessly thank you

Streaming audio was a nightmare for me to implement.

I wrote about it in this article: https://medium.com/@aleksmilanov/the-ai-awakening-ab87546abd06

Happy to provide more detail / code.

If you are not running on a smart TV then it will be a lot easier to implement!

1 Like

Does the .net sdk have streaming capability for the AudioClient?

so i use express for my backend, how would one implement this with Websockets?