Streaming from Text-to-Speech api

Same here, the only format that worked was aac or mp3, however I couldn’t make it work with mediasource on firefox. These formats work only on Chrome/Edge .

Any solutions to that? Making opus work would be helpful …

For me, it plays well from my Vue client that uses a HTML audio element. Also, in case you have an express server, I just tried this with a fairly lengthy bedtime story, and it played almost instantly, so I believe streaming is working:

const generateOpenAIAudio = async (text, req, res) => {
  const response = await openai.audio.speech.create({
    model: "tts-1",
    voice: "nova",
    input: text,
    format: "opus",
  });

  console.log("generating streaming audio for: ", text);

  res.writeHead(200, {
    "Content-Type": "audio/ogg",
    "Transfer-Encoding": "chunked",
  });

  const readableStream = response.body;

  // Pipe the readable stream to the response
  readableStream.pipe(res);


  readableStream.on("end", () => {
    console.log(`Stream ended.`);
    res.end();
  });

  readableStream.on("error", (e) => {
    res.end();
    console.error("Error streaming TTS:", e);
  });
};
1 Like

Does anyone know how to make it stream and play in python? I can stream the input bytes, but I don’t know of a python library that can play streamed audio unless it is PCM data. Support for mp3 and opus seems to require a physical file, and I don’t think I can write mp3 chunks into files and play them separately.

I was able to get some preliminary results with streaming + playing back audio in real-time using pyaudio:

import os
import requests
from time import time
import pyaudio

url = "https://api.openai.com/v1/audio/speech"
headers = {
    "Authorization": f'Bearer {os.getenv("OPENAI_API_KEY")}',
}

data = {
    "model": "tts-1",
    "input": "This is a test",
    "voice": "shimmer",
    "response_format": "wav",
}

start_time = time()
response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
    print(f"Time to first byte: {int((time() - start_time) * 1000)} ms")
    p = pyaudio.PyAudio()
    stream = p.open(format=8, channels=1, rate=24000, output=True)
    for chunk in response.iter_content(chunk_size=1024):
        stream.write(chunk)
    print(f"Time to complete: {int((time() - start_time) * 1000)} ms")

HEADPHONE WARNING: this can cause very harsh noise, especially on longer inputs. Keep volume low.

The best part is that this drastically reduces latency to about 200-500 ms (time to first byte). I found latency was lowest with "response_format": "wav", though the trade-off is larger file size. But on any decent connection, the bottleneck will still be generation speed, not network.

I got the values for format, channels and rate by writing the stream to a .wav file and analyzing it as per pyaudio docs:

import os
import requests
import io
import wave
import pyaudio

# url = ...
# headers = ...
# data = ...

response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
    buffer = io.BytesIO()
    for chunk in response.iter_content(chunk_size=1024):
        buffer.write(chunk)

with open("speech.wav", "wb") as f:
    f.write(buffer.getvalue())

with wave.open('speech.wav', 'rb') as wf:
    p = pyaudio.PyAudio()

    print('format', p.get_format_from_width(wf.getsampwidth()))
    print('channels', wf.getnchannels())
    print('rate', wf.getframerate())

But my guess is that this is wrong, and is the cause of the intermittent noise.

response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
    print(f"Time to first byte: {int((time() - start_time) * 1000)} ms")
    p = pyaudio.PyAudio()
    stream = p.open(format=8, channels=1, rate=24000, output=True)
    for chunk in response.iter_content(chunk_size=1024):
        stream.write(chunk)

I tried this out (on Linux).

This gives an audible pop at the beginning of playback. I found you can avoid the pop by skipping the (non-audio data) wav header at the beginning of the response ( it seems like they send the header alone as the first chunk, 44 bytes long).

Now, I don’t know why longer text inputs yield such horrible audio noise. The weird thing is that if you grab the whole response and write it to a wav file, it plays just fine. But of course that defeats the purpose. I would really like to figure this out, but I’m struggling with it. Please let me know if you make any progress.

Edit: There’s a correlation between chunk lengths that are odd and the noise issue, at least for me. Shorter text inputs always get the 44-byte first chunk (wav header), and then 1024-byte chunks until the last chunk. But longer text inputs result in sooner or later receiving a chunk which is not whole, and at that moment, the audio goes awry. Still not sure what to do about it…

Hey, that’s cool!

I had the pop at the start too, I just ignored it at first but you make a good point with the header. I can confirm that skipping the first chunk fixes it.

I think it’s normal for chunks to sometimes be different size, probably just a networking thing. Quick thought: we could have a buffer that waits until it contains at least 1024 bytes before feeding a chunk into pyaudio? I’ll see if I can make it work later.

I’m on MacOS btw

Quick thought: we could have a buffer that waits until it contains at least 1024 bytes before feeding a chunk into pyaudio?

Holy crap that worked! I went with 1024 exactly, and it seems like that fixed it. That’s awesome. Ok I can go to bed now.

Good thinking!

@aaronepperly Awesome! If you want to share your code that would be very much appreciated!

@nimobeeren I hesitated to post my code because it is fairly ugly at the moment, but here it is:

my_buffer = bytes()                 # "my_buffer" gets loaded up from the http stream
my_1024 = bytes()                    # when "my_buffer" has enough, it gets sliced off into "my_1024"

if response.status_code == 200:
	is_first_chunk = True
	stream = p.open(format=8, channels=1, rate=24000, output=True)
	for chunk in response.iter_content(chunk_size=1024):
		if is_first_chunk:                                     # skip the header
			is_first_chunk = False
			continue
		my_buffer += chunk
		if len(my_buffer) >= 1024:
			my_1024 = my_buffer[0:1024]
			my_buffer = my_buffer[1024:]
		if len(my_1024):
			stream.write(my_1024)
			my_1024 = bytes()
	if len(my_buffer):                                           # Whatever is left in my_buffer, because it will likely
		stream.write(my_buffer)                    # be less than 1024 samples long
	stream.close()

On another note, concerning the arguments to the call that opens the PyAudio stream… I think it would be wise of us to replace:

format=8, channels=1, rate=24000

with values we read out of the wav header from openai’s http response, that first 44-byte chunk, to make our code more robust with respect to changes openai could make in the future.

1 Like

I finally had some time to come back to this and found a pretty simple solution. It’s possible to directly pass the response.raw stream into the wave.open call, which automatically deals with parsing the header and buffering chunks. I’m getting full playback without noise with this:

import os
from time import sleep
import wave
import requests
import pyaudio


# url = ...
# headers = ...
# data = ...

response = requests.post('https://api.openai.com/v1/audio/speech', headers=headers, json=data, stream=True)

CHUNK_SIZE = 1024

if response.ok:
    with wave.open(response.raw, 'rb') as wf:
        p = pyaudio.PyAudio()
        stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                        channels=wf.getnchannels(),
                        rate=wf.getframerate(),
                        output=True)

        while len(data := wf.readframes(CHUNK_SIZE)): 
            stream.write(data)

        # Sleep to make sure playback has finished before closing
        sleep(1)
        stream.close()
        p.terminate()
else:
    response.raise_for_status()

So basically just the example from PyAudio docs with the response.raw stream plugged in.

I did notice the sound cutting off a bit before the end sometimes, which is why I added the sleep(1). A bit strange as this wasn’t happening before, but it works.

1 Like

@nimobeeren

if response.ok:
    with wave.open(response.raw, 'rb') as wf:
        p = pyaudio.PyAudio()
        stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                        channels=wf.getnchannels(),
                        rate=wf.getframerate(),
                        output=True)

        while len(data := wf.readframes(CHUNK_SIZE)): 
            stream.write(data)

That looks really clean!

1 Like

If you wouldn’t mind could you please elaborate how you’re playing the audio on the front end? I’m not having any luck with it, but my backend is definitely sending it with chunked encoding.

Sure, so there are two parts:

Server side is a REST Api with Node.js and Express, something like this:

app.get("/api/stream", async (req, res) => {
  const { text, voice } = req.query; // Assuming the text for TTS is passed as a query parameter
generateOpenAIAudio(text, voice, req, res);
});

And actually, I’ve just checked and currently I’m using these parameters for the
generateOpenAIAudio audio generation function:

...
const response = await openai.audio.speech.create({
    model: "tts-1",
    voice: voice,
    input: text,
    format: "mp3",
    speed: 1.1,
  });

...

res.writeHead(200, {
    "Content-Type": "audio/mpeg",
  });

Client side is a Vue.js client app, where I have something like this inside a component:

template part:
<audio ref="audioPlayer" crossorigin="anonymous"></audio>

script part:

startAudioStream(text, voice) {
      const streamUrl = `http://localhost:3000/api/stream?voice=${voice}&text=${encodeURIComponent(
        text
      )}`;
      this.playAudioStream(streamUrl);
    },

playAudioStream(streamUrl) {

      // Reference the audio player element.
      const audio = this.$refs.audioPlayer;
      audio.src = streamUrl;

      if (!this.audioContext) {
        // Initialize the AudioContext only once
        this.audioContext = new (window.AudioContext ||
          window.webkitAudioContext)();

        // Create the MediaElementSource node only once
        this.source = this.audioContext.createMediaElementSource(audio);
      }

      // Listen for the 'play' event to play the audio
      // This ensures that the audio is likely to play through without interruption
      audio
        .play()
        .then(() => {
          console.log("Audio playing...");
        })
        .catch((err) => {
          console.error("Error playing audio:", err);
        });

      // _You can also add an 'ended' event listener to do something once the playing has ended
      audio.onended = () => {
        console.log("Audio ended.");
        ...
      };
    },

I hope it helps.

1 Like

Thank you! Yeah, I was attempting to use a streamed response to make it feel more real-time, but it seems like some browsers don’t play nice with it quite yet.

I was attempting to use the MediaSource API to append the chunks as they came in from the server, but it has been anything but straightforward to get working.

Note this GitHub issue which shows an example for streaming TTS output to speakers using the openai Python library.

Because I don’t see the “stream: true” in your request, can you please explain how this code actually stream the response rather than autoplay once the entire response is ready?

Yes, I’ve found that stream: true doesn’t make a difference in this case. If you use console log to write out when the streaming ends, You’ll see that the audio starts way before the whole response arrives from the API call. At least that’s my experience.

I haven’t tried Greg’s code, but essentially any request to fetch some binary data is “streamed” in the sense that the data is split into chunks which are sent sequentially to the client. The HTML audio element is probably set up to work with those streams by playing the audio as soon as the first chunks start coming in. What we don’t know is whether OpenAI actually starts sending chunks before the entire audio is generated on their end. But I suspect they do, otherwise we’d see longer time-to-first-chunk on longer input text, but that hasn’t been the case in my experiments.

1 Like

Yup I verify (only javascript) that by setting streaming=“true”, the response is chunked, and thus the delay is reduced significantly. However, because these chunks experience delay variation as they travel the network, you cannot just playback by pushing them in the audio buffer. You need to create a small buffer before the playback in order to smooth-out the delay variation.
Of course, the small buffer would increase the delay in terms of user experience but for long audio responses surely improves user experience.

Could you share code of how you’ve managed to make it work

1 Like