GPT4 audio preview with streaming of audio output

Hello,

I’m currently working with the gpt-4o-audio-preview model to replace my existing system, which separately requests completion APIs for text and then calls text-to-speech (TTS) for audio. To improve response times, I’ve incorporated streaming responses and read chunks. However, I need assistance retrieving audio bytes and transcripts from these chunks, as I couldn’t find any examples or documentation on handling chunks using the Python library.

Below is a snippet of my code for context:

# Using OpenAI chat completions with audio modality
for chunk in client.chat.completions.create(
    model="gpt-4o-audio-preview",
    messages=prompt,
    modalities=["text", "audio"],
    audio=audio_params,
    temperature=0.7,
    max_completion_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stream=True):
    
    # Extracting text and audio chunks
    text_chunk = chunk.choices[0].delta.get('audio', {}).get('transcript', None)
    audio_chunk = chunk.choices[0].delta.get('audio', {}).get('data', None)
    
    if text_chunk:
        yield "text", text_chunk
    if audio_chunk:
        yield "audio", audio_chunk

When I print the chunks, they appear in this format:

delta=ChoiceDelta(content=None, function_call=None, refusal=None, role='assistant', tool_calls=None, audio={'id': 'audio_678a9b2278288190b874aa41276aa1fc', 'data': '/v/8//r//v/6//z/+//9//7//f8AAPz//f/6//3/AQAAAAAA//8CAPz//..."}, finish_reason=None, index=0, logprobs=None)

Some chunks include audio data, while others contain parts of the transcript. The code snippet below failed to work, and I received an error indicating that ‘audio’ does not exist under the delta object:

chunk.choices[0].delta.audio

Could anyone provide guidance on correctly parsing text and audio chunks from the streaming response? Any help or examples would be greatly appreciated!

Thank you!

You will need to gather each type possible in a response in parallel within a delta chunk with a data collector.

For example, my library has a simple class for total response object:

class ResponseState:
    """Holds the state of the streaming response."""
    def __init__(self) -> None:
        self.content: str | None = ''
        self.function_call: dict[str, str] = {}
        self.tool_calls: dict[int, dict[str, any]] = {}
        self.finish_reason: str | None = None
        self.audio_transcript: str | None = None
        self.audio_id: str | None = None
        self.audio_data:  str | None = None
        self.usage: dict[str, any] | None = None

You can fill it up with deltas.

Then my usage, deep within an iterative tool response parser after SSE chunk parser (I’m not using the openai library), each type of response object is collected.

    choice = choices[0]
    delta = choice.get('delta', {})
    if 'content' in delta and delta['content'] is not None:
        content_piece = delta['content']
        state.content += content_piece
     ...
    if 'function_call' in delta and delta['function_call']:
        for key, value in delta['function_call'].items():
            # Accumulate the function_call parts as they may be streamed in chunks
            state.function_call[key] = state.function_call.get(key, '') + value
    ...

There’s also an unpictured “emit” for content not shown as it can be displayed “live”.

Since the AI could produce content and a tool call, audio and a transcript and an ID you must re-use, usage and a refusal reason (or whatever), you must not just look for one type of delta.


The chat completions audio is a base64 file of format you specify. You would typically collect the entire file.

Since base64 is a 3-to-4 encoding, and pcm16 would be without a header, it may be possible to collect into a buffer, do your own stream decoder, and then play before finished, but you could also end up playing noise if the API or unwrapper goes wrong, or have a buffer underrun.

Another possibility is to re-stream a format like mp3 after unwrapping the base64 when you have a client that can do the buffering, since mp3 is packetized with a packet signature.


You can look at helpers.md in the openai Python SDK and read how openai wrote some beta stream collectors, but it has no audio methods.

1 Like

I finally came with a working solution for my need :

#OpenAI chat completions with audio modality
        for chunk in client.chat.completions.create(
                model="gpt-4o-audio-preview",
                messages=self.conv_dict[bot_name],
                modalities=["text", "audio"],
                audio=audio_params,
                temperature=0.7,
                max_completion_tokens=1000,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0,
                stream=True):
            if hasattr(chunk.choices[0].delta, 'audio'):
                if 'transcript' in chunk.choices[0].delta.audio:
                    print(chunk.choices[0].delta.audio['transcript'])
                    text_chunk = chunk.choices[0].delta.audio['transcript']
                    yield "text", text_chunk
                if 'data' in chunk.choices[0].delta.audio:
                    audio_chunk = chunk.choices[0].delta.audio['data']
                    yield "audio", base64.b64decode(audio_chunk)

Now I get chunk of text from transcript and chunks of the audio and I can start playing the audio while the stream is still sending data. I’m now searching for the best way to read the audio as the stream is still loading data.

1 Like