Hello,
I’m currently working with the gpt-4o-audio-preview
model to replace my existing system, which separately requests completion APIs for text and then calls text-to-speech (TTS) for audio. To improve response times, I’ve incorporated streaming responses and read chunks. However, I need assistance retrieving audio bytes and transcripts from these chunks, as I couldn’t find any examples or documentation on handling chunks using the Python library.
Below is a snippet of my code for context:
# Using OpenAI chat completions with audio modality
for chunk in client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=prompt,
modalities=["text", "audio"],
audio=audio_params,
temperature=0.7,
max_completion_tokens=400,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stream=True):
# Extracting text and audio chunks
text_chunk = chunk.choices[0].delta.get('audio', {}).get('transcript', None)
audio_chunk = chunk.choices[0].delta.get('audio', {}).get('data', None)
if text_chunk:
yield "text", text_chunk
if audio_chunk:
yield "audio", audio_chunk
When I print the chunks, they appear in this format:
delta=ChoiceDelta(content=None, function_call=None, refusal=None, role='assistant', tool_calls=None, audio={'id': 'audio_678a9b2278288190b874aa41276aa1fc', 'data': '/v/8//r//v/6//z/+//9//7//f8AAPz//f/6//3/AQAAAAAA//8CAPz//..."}, finish_reason=None, index=0, logprobs=None)
Some chunks include audio data, while others contain parts of the transcript. The code snippet below failed to work, and I received an error indicating that ‘audio’ does not exist under the delta object:
chunk.choices[0].delta.audio
Could anyone provide guidance on correctly parsing text and audio chunks from the streaming response? Any help or examples would be greatly appreciated!
Thank you!