Streaming from Text-to-Speech api

nathard12 · January 25, 2024, 2:22pm

Same here, the only format that worked was aac or mp3, however I couldn’t make it work with mediasource on firefox. These formats work only on Chrome/Edge .

Any solutions to that? Making opus work would be helpful …

greg.bendash · February 16, 2024, 12:46pm

For me, it plays well from my Vue client that uses a HTML audio element. Also, in case you have an express server, I just tried this with a fairly lengthy bedtime story, and it played almost instantly, so I believe streaming is working:

const generateOpenAIAudio = async (text, req, res) => {
  const response = await openai.audio.speech.create({
    model: "tts-1",
    voice: "nova",
    input: text,
    format: "opus",
  });

  console.log("generating streaming audio for: ", text);

  res.writeHead(200, {
    "Content-Type": "audio/ogg",
    "Transfer-Encoding": "chunked",
  });

  const readableStream = response.body;

  // Pipe the readable stream to the response
  readableStream.pipe(res);


  readableStream.on("end", () => {
    console.log(`Stream ended.`);
    res.end();
  });

  readableStream.on("error", (e) => {
    res.end();
    console.error("Error streaming TTS:", e);
  });
};

d.mayfield · February 16, 2024, 4:50pm

Does anyone know how to make it stream and play in python? I can stream the input bytes, but I don’t know of a python library that can play streamed audio unless it is PCM data. Support for mp3 and opus seems to require a physical file, and I don’t think I can write mp3 chunks into files and play them separately.

nimobeeren · February 22, 2024, 9:57pm

I was able to get some preliminary results with streaming + playing back audio in real-time using pyaudio:

import os
import requests
from time import time
import pyaudio

url = "https://api.openai.com/v1/audio/speech"
headers = {
    "Authorization": f'Bearer {os.getenv("OPENAI_API_KEY")}',
}

data = {
    "model": "tts-1",
    "input": "This is a test",
    "voice": "shimmer",
    "response_format": "wav",
}

start_time = time()
response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
    print(f"Time to first byte: {int((time() - start_time) * 1000)} ms")
    p = pyaudio.PyAudio()
    stream = p.open(format=8, channels=1, rate=24000, output=True)
    for chunk in response.iter_content(chunk_size=1024):
        stream.write(chunk)
    print(f"Time to complete: {int((time() - start_time) * 1000)} ms")

HEADPHONE WARNING: this can cause very harsh noise, especially on longer inputs. Keep volume low.

The best part is that this drastically reduces latency to about 200-500 ms (time to first byte). I found latency was lowest with "response_format": "wav", though the trade-off is larger file size. But on any decent connection, the bottleneck will still be generation speed, not network.

I got the values for format, channels and rate by writing the stream to a .wav file and analyzing it as per pyaudio docs:

import os
import requests
import io
import wave
import pyaudio

# url = ...
# headers = ...
# data = ...

response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
    buffer = io.BytesIO()
    for chunk in response.iter_content(chunk_size=1024):
        buffer.write(chunk)

with open("speech.wav", "wb") as f:
    f.write(buffer.getvalue())

with wave.open('speech.wav', 'rb') as wf:
    p = pyaudio.PyAudio()

    print('format', p.get_format_from_width(wf.getsampwidth()))
    print('channels', wf.getnchannels())
    print('rate', wf.getframerate())

But my guess is that this is wrong, and is the cause of the intermittent noise.

aaronepperly · February 23, 2024, 6:56am

response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
    print(f"Time to first byte: {int((time() - start_time) * 1000)} ms")
    p = pyaudio.PyAudio()
    stream = p.open(format=8, channels=1, rate=24000, output=True)
    for chunk in response.iter_content(chunk_size=1024):
        stream.write(chunk)

I tried this out (on Linux).

This gives an audible pop at the beginning of playback. I found you can avoid the pop by skipping the (non-audio data) wav header at the beginning of the response ( it seems like they send the header alone as the first chunk, 44 bytes long).

Now, I don’t know why longer text inputs yield such horrible audio noise. The weird thing is that if you grab the whole response and write it to a wav file, it plays just fine. But of course that defeats the purpose. I would really like to figure this out, but I’m struggling with it. Please let me know if you make any progress.

Edit: There’s a correlation between chunk lengths that are odd and the noise issue, at least for me. Shorter text inputs always get the 44-byte first chunk (wav header), and then 1024-byte chunks until the last chunk. But longer text inputs result in sooner or later receiving a chunk which is not whole, and at that moment, the audio goes awry. Still not sure what to do about it…

nimobeeren · February 23, 2024, 7:53am

Hey, that’s cool!

I had the pop at the start too, I just ignored it at first but you make a good point with the header. I can confirm that skipping the first chunk fixes it.

I think it’s normal for chunks to sometimes be different size, probably just a networking thing. Quick thought: we could have a buffer that waits until it contains at least 1024 bytes before feeding a chunk into pyaudio? I’ll see if I can make it work later.

I’m on MacOS btw

aaronepperly · February 23, 2024, 8:12am

Quick thought: we could have a buffer that waits until it contains at least 1024 bytes before feeding a chunk into pyaudio?

Holy crap that worked! I went with 1024 exactly, and it seems like that fixed it. That’s awesome. Ok I can go to bed now.

Good thinking!

nimobeeren · February 23, 2024, 8:23am

@aaronepperly Awesome! If you want to share your code that would be very much appreciated!

aaronepperly · February 23, 2024, 10:32pm

@nimobeeren I hesitated to post my code because it is fairly ugly at the moment, but here it is:

my_buffer = bytes()                 # "my_buffer" gets loaded up from the http stream
my_1024 = bytes()                    # when "my_buffer" has enough, it gets sliced off into "my_1024"

if response.status_code == 200:
	is_first_chunk = True
	stream = p.open(format=8, channels=1, rate=24000, output=True)
	for chunk in response.iter_content(chunk_size=1024):
		if is_first_chunk:                                     # skip the header
			is_first_chunk = False
			continue
		my_buffer += chunk
		if len(my_buffer) >= 1024:
			my_1024 = my_buffer[0:1024]
			my_buffer = my_buffer[1024:]
		if len(my_1024):
			stream.write(my_1024)
			my_1024 = bytes()
	if len(my_buffer):                                           # Whatever is left in my_buffer, because it will likely
		stream.write(my_buffer)                    # be less than 1024 samples long
	stream.close()

On another note, concerning the arguments to the call that opens the PyAudio stream… I think it would be wise of us to replace:

format=8, channels=1, rate=24000

with values we read out of the wav header from openai’s http response, that first 44-byte chunk, to make our code more robust with respect to changes openai could make in the future.

nimobeeren · February 29, 2024, 2:42pm

I finally had some time to come back to this and found a pretty simple solution. It’s possible to directly pass the response.raw stream into the wave.open call, which automatically deals with parsing the header and buffering chunks. I’m getting full playback without noise with this:

import os
from time import sleep
import wave
import requests
import pyaudio


# url = ...
# headers = ...
# data = ...

response = requests.post('https://api.openai.com/v1/audio/speech', headers=headers, json=data, stream=True)

CHUNK_SIZE = 1024

if response.ok:
    with wave.open(response.raw, 'rb') as wf:
        p = pyaudio.PyAudio()
        stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                        channels=wf.getnchannels(),
                        rate=wf.getframerate(),
                        output=True)

        while len(data := wf.readframes(CHUNK_SIZE)): 
            stream.write(data)

        # Sleep to make sure playback has finished before closing
        sleep(1)
        stream.close()
        p.terminate()
else:
    response.raise_for_status()

So basically just the example from PyAudio docs with the response.raw stream plugged in.

I did notice the sound cutting off a bit before the end sometimes, which is why I added the sleep(1). A bit strange as this wasn’t happening before, but it works.

aaronepperly · February 29, 2024, 10:38pm

@nimobeeren

if response.ok:
    with wave.open(response.raw, 'rb') as wf:
        p = pyaudio.PyAudio()
        stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                        channels=wf.getnchannels(),
                        rate=wf.getframerate(),
                        output=True)

        while len(data := wf.readframes(CHUNK_SIZE)): 
            stream.write(data)

That looks really clean!

Torbjorn · March 1, 2024, 7:58pm

If you wouldn’t mind could you please elaborate how you’re playing the audio on the front end? I’m not having any luck with it, but my backend is definitely sending it with chunked encoding.

greg.bendash · March 2, 2024, 10:07am

Sure, so there are two parts:

Server side is a REST Api with Node.js and Express, something like this:

app.get("/api/stream", async (req, res) => {
  const { text, voice } = req.query; // Assuming the text for TTS is passed as a query parameter
generateOpenAIAudio(text, voice, req, res);
});

And actually, I’ve just checked and currently I’m using these parameters for the
generateOpenAIAudio audio generation function:

...
const response = await openai.audio.speech.create({
    model: "tts-1",
    voice: voice,
    input: text,
    format: "mp3",
    speed: 1.1,
  });

...

res.writeHead(200, {
    "Content-Type": "audio/mpeg",
  });

Client side is a Vue.js client app, where I have something like this inside a component:

template part:
<audio ref="audioPlayer" crossorigin="anonymous"></audio>

script part:

startAudioStream(text, voice) {
      const streamUrl = `http://localhost:3000/api/stream?voice=${voice}&text=${encodeURIComponent(
        text
      )}`;
      this.playAudioStream(streamUrl);
    },

playAudioStream(streamUrl) {

      // Reference the audio player element.
      const audio = this.$refs.audioPlayer;
      audio.src = streamUrl;

      if (!this.audioContext) {
        // Initialize the AudioContext only once
        this.audioContext = new (window.AudioContext ||
          window.webkitAudioContext)();

        // Create the MediaElementSource node only once
        this.source = this.audioContext.createMediaElementSource(audio);
      }

      // Listen for the 'play' event to play the audio
      // This ensures that the audio is likely to play through without interruption
      audio
        .play()
        .then(() => {
          console.log("Audio playing...");
        })
        .catch((err) => {
          console.error("Error playing audio:", err);
        });

      // _You can also add an 'ended' event listener to do something once the playing has ended
      audio.onended = () => {
        console.log("Audio ended.");
        ...
      };
    },

I hope it helps.

Torbjorn · March 4, 2024, 2:55pm

Thank you! Yeah, I was attempting to use a streamed response to make it feel more real-time, but it seems like some browsers don’t play nice with it quite yet.

I was attempting to use the MediaSource API to append the chunks as they came in from the server, but it has been anything but straightforward to get working.

nimobeeren · March 4, 2024, 9:35pm

Note this GitHub issue which shows an example for streaming TTS output to speakers using the openai Python library.

stamatis.kourtis · March 17, 2024, 12:40pm

Because I don’t see the “stream: true” in your request, can you please explain how this code actually stream the response rather than autoplay once the entire response is ready?

greg.bendash · March 17, 2024, 2:00pm

Yes, I’ve found that stream: true doesn’t make a difference in this case. If you use console log to write out when the streaming ends, You’ll see that the audio starts way before the whole response arrives from the API call. At least that’s my experience.

nimobeeren · March 17, 2024, 3:29pm

I haven’t tried Greg’s code, but essentially any request to fetch some binary data is “streamed” in the sense that the data is split into chunks which are sent sequentially to the client. The HTML audio element is probably set up to work with those streams by playing the audio as soon as the first chunks start coming in. What we don’t know is whether OpenAI actually starts sending chunks before the entire audio is generated on their end. But I suspect they do, otherwise we’d see longer time-to-first-chunk on longer input text, but that hasn’t been the case in my experiments.

stamatis.kourtis · March 20, 2024, 7:45am

Yup I verify (only javascript) that by setting streaming=“true”, the response is chunked, and thus the delay is reduced significantly. However, because these chunks experience delay variation as they travel the network, you cannot just playback by pushing them in the audio buffer. You need to create a small buffer before the playback in order to smooth-out the delay variation.
Of course, the small buffer would increase the delay in terms of user experience but for long audio responses surely improves user experience.

lev1 · March 29, 2024, 8:12am

Could you share code of how you’ve managed to make it work

Topic		Replies	Views
Python integration of real time? API	13	2134	October 5, 2024
Problems using session.update with the realtime-api (issue with "input_audio_transcription") Bugs api-realtime , api-realtime-speech	10	862	October 15, 2024
Realtime API (Advanced Voice Mode) Python Implementation API gpt-4o , advanced-voice , realtime	14	4357	October 29, 2024
Streaming is now available in the Assistants API! API api , assistants , assistants-api , assistants-streaming	54	24173	September 20, 2024
[Realtime API] AI Answering Gibberish API realtime , api-realtime , api-realtime-speech	9	282	October 25, 2024

Streaming from Text-to-Speech api

Related topics