The Speech API provides support for real time audio streaming using chunk transfer encoding. This means that the audio is able to be played before the full file has been generated and made accessible.
I am trying since hours to playback chunks of the openai.audio.speech.create() response to a soundevice output stream, am trying sth. along the lines of
response = openai.audio.speech.create(input="...") # shortened for brevity
samplerate = 24000.0 # Got this from `data, fs = sf.read([the whole file])`
channels = 2
blocksize = 1024
stream = sd.OutputStream(
device=1,
samplerate=samplerate,
channels=channels,
dtype="float32",
prime_output_buffers_using_stream_callback=False, # I found that in the sd.play() method which does play the sound nicely once entirely written to file
)
read_size = blocksize * channels * stream.samplesize
with stream:
# This is what stream_to_file() would be doing
for chunk in response.iter_bytes(chunk_size=read_size):
data, samplerate = sf.read(io.BytesIO(chunk), dtype="float32")
stream.write(data)
Depending on the response_format I chose, be it mp3, opus or flac, I get different errors either when doing the sf.read() or stream.write()
I am a total noob when it comes to audio formats and handling. Am I on the right track at all?
Hey @cesarbaudi1@TomTom101 , just posted a thread on X that might help. Basically make a call to the speech endpoint and use pyaudio to chunk and play as it streams.
@tonycamonte are you actually getting streaming audio, though?
From what I can tell, {spoken_response} generates the full audio output all in one go. I have a similar script, and it’ll play, sure, but put a print statement before buffer() and give it a good chunk of text, and you’ll see that it’s 30 seconds of processing before it even tries to assign a value to the buffer.
I investigated the source code, and I noticed that the original create() function lacked a stream parameter entirely, which would suggest that it is not designed for streaming by default:
To test streaming capabilities, I attempted to modify the function signature by adding a stream parameter and passing it through to _post (or similar lower-level function) that would accept this parameter.
def create(
self,
... # Other parameters
stream: bool = False # Added stream parameter
) -> HttpResponseType:
response = self._post(
... # Post request details, now including the stream parameter
stream=stream
)
return response
Despite these changes and setting stream=True, I observed no change in behavior leading me to believe that there is no support for streaming, or that additional changes are required to properly enable this feature.
The changes to the python library seem to just allow passing, not parsing.
For example, want to see if logprobs are supported in chat completions? There’s alterations you need to make all over the library to allow them out and back in, because it enforces the API schema.
The unanswered question is if streaming is actually done logically, such as on the frames of the underlying audio format that use independent packets. For example: flac - not streamable any more than a zip file. mp3 - uses cross-frame buffer. Compare to Opus - neural packet loss concealment, because it has a foundation in streaming.
So, I think I have streaming running on my Django server (using django-ninja). Here is the code I’m using. Note that it logs the chunks properly, which is why I think I have streaming running. Also, you have to use the API and requests library. This doesn’t work with the Python SDK.
@router.post("/stream_audio/")
def stream_audio(request, payload: AudioRequest):
# OpenAI API endpoint and parameters
url = "https://api.openai.com/v1/audio/speech"
headers = {
"Authorization": f'Bearer {settings.OPENAI_KEY}',
}
data = {
"model": payload.model,
"input": payload.input_text,
"voice": payload.voice,
"response_format": "opus",
}
response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code == 200:
# Printing each chunk size
def generate():
for chunk in response.iter_content(chunk_size=1024):
print(f"Chunk size: {len(chunk)}") # Print the size of each chunk
yield chunk
return StreamingHttpResponse(
streaming_content=generate(),
content_type="audio/opus"
)
else:
return {"error": f"Error: {response.status_code} - {response.text}"}
I am running into a problem on the frontend, however. When my React/Vite frontend makes a request to this api, I see all the chunks logged in the console, but the app request waits for the last chunk before it will play. I think there’s buffering going on in the frontend.
I believe you should use the response_format="opus" parameter. I don’t understand much about this, but I have a hunch that this might be the way to go.
If you manage to do it, please don’t forget to let us know!
I am having mixed results depending in the formats. Notably opus is not streamable to any of the browsers I have on my hand, e.g. Chrome, Safari, Firefox.
Did anybody of you managed to get opus to play in browser?
If so, care to share the snippets handling this format?