Streaming from Text-to-Speech api

The API documentation reads:

The Speech API provides support for real time audio streaming using chunk transfer encoding. This means that the audio is able to be played before the full file has been generated and made accessible.

I am trying since hours to playback chunks of the openai.audio.speech.create() response to a soundevice output stream, am trying sth. along the lines of

response = openai.audio.speech.create(input="...") # shortened for brevity

samplerate = 24000.0 # Got this from `data, fs = sf.read([the whole file])`
channels = 2
blocksize = 1024
stream = sd.OutputStream(
        device=1,
        samplerate=samplerate,
        channels=channels,
        dtype="float32",
        prime_output_buffers_using_stream_callback=False, # I found that in the sd.play() method which does play the sound nicely once entirely written to file
    )
read_size = blocksize * channels * stream.samplesize
with stream:
    # This is what stream_to_file() would be doing
    for chunk in response.iter_bytes(chunk_size=read_size):
       data, samplerate = sf.read(io.BytesIO(chunk), dtype="float32")
       stream.write(data)

Depending on the response_format I chose, be it mp3, opus or flac, I get different errors either when doing the sf.read() or stream.write()

I am a total noob when it comes to audio formats and handling. Am I on the right track at all?

Thanks!

6 Likes

Any update about it? I’m trying do the same with nodejs.

Hey @cesarbaudi1 @TomTom101 , just posted a thread on X that might help. Basically make a call to the speech endpoint and use pyaudio to chunk and play as it streams.

9 Likes

Thanks @gonzalo!

with requests.post(url, headers=headers, json=data, stream=True) as response:

So no luck using the API directly? You had to make a regular POST request?

There are hints in the openai package code that it at least CAN do stream requests:

    def post(
        self,
        path: str,
        *,
        cast_to: Type[ResponseT],
        body: Body | None = None,
        options: RequestOptions = {},
        files: RequestFiles | None = None,
        stream: Literal[True],
        stream_cls: type[_StreamT],
    ) -> _Str:
    ....

Have you verified the API does not send stream requests? At least it should be doable.

managed to get streamed audio working using the regular python sdk:

spoken_response = client.audio.speech.create(
  model="tts-1-hd",
  voice="fable",
  response_format="opus",
  input=response
)

buffer = io.BytesIO()
for chunk in spoken_response.iter_bytes(chunk_size=4096):
  buffer.write(chunk)
buffer.seek(0)

with sf.SoundFile(buffer, 'r') as sound_file:
  data = sound_file.read(dtype='int16')
  sd.play(data, sound_file.samplerate)
  sd.wait()
3 Likes

@tonycamonte are you actually getting streaming audio, though?

From what I can tell, {spoken_response} generates the full audio output all in one go. I have a similar script, and it’ll play, sure, but put a print statement before buffer() and give it a good chunk of text, and you’ll see that it’s 30 seconds of processing before it even tries to assign a value to the buffer.

4 Likes

That would certainly explain why the speed improvement wasn’t very noticable. Whoops

1 Like

Still running into errors attempting to stream the audio with FastAPI client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input=request.text,
)

async def generate():
    for chunk in response.iter_content(chunk_size=4096):
        yield chunk

return StreamingResponse(generate(), media_type="audio/mp3")

I investigated the source code, and I noticed that the original create() function lacked a stream parameter entirely, which would suggest that it is not designed for streaming by default:

def create(
    self,
    ...  # Other parameters
) -> HttpResponseType:
    response = self._post(
        ...  # Post request details
    )
    return response

To test streaming capabilities, I attempted to modify the function signature by adding a stream parameter and passing it through to _post (or similar lower-level function) that would accept this parameter.

def create(
    self,
    ...  # Other parameters
    stream: bool = False  # Added stream parameter
) -> HttpResponseType:
    response = self._post(
        ...  # Post request details, now including the stream parameter
        stream=stream
    )
    return response

Despite these changes and setting stream=True, I observed no change in behavior leading me to believe that there is no support for streaming, or that additional changes are required to properly enable this feature.

2 Likes

I just ran the request in Postman and it does show a “Transfer-Encoding: chunked” header. So it is a stream.

I just came across this:

" Is it possible to stream audio?
Yes! By setting stream=True, you can chunk the returned audio file."

Has anyone had any luck?

I did by forking and modifying the python client lib to allow that, stream param to Speech and AsyncSpeech by antont · Pull Request #724 · openai/openai-python · GitHub

Now it seems to be supported in the upstream version, not released yet, fix(client): add support for streaming binary responses by stainless-bot · Pull Request #866 · openai/openai-python · GitHub

The changes to the python library seem to just allow passing, not parsing.

For example, want to see if logprobs are supported in chat completions? There’s alterations you need to make all over the library to allow them out and back in, because it enforces the API schema.

The unanswered question is if streaming is actually done logically, such as on the frames of the underlying audio format that use independent packets. For example: flac - not streamable any more than a zip file. mp3 - uses cross-frame buffer. Compare to Opus - neural packet loss concealment, because it has a foundation in streaming.

Hey everyone!

So, I think I have streaming running on my Django server (using django-ninja). Here is the code I’m using. Note that it logs the chunks properly, which is why I think I have streaming running. Also, you have to use the API and requests library. This doesn’t work with the Python SDK.

@router.post("/stream_audio/")
def stream_audio(request, payload: AudioRequest):
    # OpenAI API endpoint and parameters
    url = "https://api.openai.com/v1/audio/speech"
    headers = {
        "Authorization": f'Bearer {settings.OPENAI_KEY}',
    }
    data = {
        "model": payload.model,
        "input": payload.input_text,
        "voice": payload.voice,
        "response_format": "opus",
    }

    response = requests.post(url, headers=headers, json=data, stream=True)
    if response.status_code == 200:
        # Printing each chunk size
        def generate():
            for chunk in response.iter_content(chunk_size=1024):
                print(f"Chunk size: {len(chunk)}")  # Print the size of each chunk
                yield chunk

        return StreamingHttpResponse(
            streaming_content=generate(),
            content_type="audio/opus"
        )
    else:
        return {"error": f"Error: {response.status_code} - {response.text}"}

I am running into a problem on the frontend, however. When my React/Vite frontend makes a request to this api, I see all the chunks logged in the console, but the app request waits for the last chunk before it will play. I think there’s buffering going on in the frontend.

If you can think of anything, let me know!

This is my solution:
github: kvsur/openai-speech-stream-player

1 Like

I believe you should use the response_format="opus" parameter. I don’t understand much about this, but I have a hunch that this might be the way to go.

If you manage to do it, please don’t forget to let us know!

This code works. Thank you @tonycamonte

I have had some level of success with this in the browser. Here is my code, courtesy some help from GPT4:

const audio = document.getElementById('audio');
const mediaSource = new MediaSource();
audio.src = URL.createObjectURL(mediaSource);

mediaSource.addEventListener('sourceopen', sourceOpen);

async function sourceOpen() {
    const sourceBuffer = mediaSource.addSourceBuffer('audio/mpeg'); // Adjust MIME type as needed
    const rs = await fetch('https://api.openai.com/v1/audio/speech', {
        method: 'POST',
        headers: {
            Authorization: 'Bearer ' + openAIApiKey,
            'Content-Type': 'application/json',
        },
        body: JSON.stringify({
            input: 'What is up?!',
            model: 'tts-1',
            response_format: 'mp3',
            voice: 'echo',
        }),
    }).then((res) => res.body);

    const reader = rs.getReader();

    reader.read().then(function process({ done, value }) {
        if (done) {
            if (mediaSource.readyState === 'open') mediaSource.endOfStream();
            return;
        }
        // If value is not in the right format, you need to transcode it here
        sourceBuffer.appendBuffer(value);

        sourceBuffer.addEventListener('updateend', () => {
            if (!sourceBuffer.updating && mediaSource.readyState === 'open') {
                reader.read().then(process);
            }
        });
    });
}

The audio starts playing in the Audio HTML Element as soon as the first data chunks are returned from the API.

Have not gotten it to work yet with OPUS media type, and there are still several errors I have yet to work out.

I got the output of OAI TTS to stream. Here’s an example:

url = "https://api.openai.com/v1/audio/speech"
headers = {
    "Authorization": 'Bearer YOUR_API_KEY', 
}

data = {
    "model": model,
    "input": input_text,
    "voice": voice,
    "response_format": "opus",
}

with requests.post(url, headers=headers, json=data, stream=True) as response:
    if response.status_code == 200:
        buffer = io.BytesIO()
        for chunk in response.iter_content(chunk_size=4096):
            buffer.write(chunk)

I am having mixed results depending in the formats. Notably opus is not streamable to any of the browsers I have on my hand, e.g. Chrome, Safari, Firefox.

Did anybody of you managed to get opus to play in browser?

If so, care to share the snippets handling this format?