I can confirm that the chat completions endpoint seem to support streaming text and audio modalities at the same time.
The only supported audio format is pcm16.
Example:
{
"model": "gpt-4o-audio-preview",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"modalities": ["text", "audio"],
"audio": {
"voice": "alloy",
"format": "pcm16"
},
"stream_options": {
"include_usage": true
},
"stream": true
}
Response:
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"role":"assistant","refusal":null},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"content":null,"audio":{"id":"audio_67153952b364819093d6a4aac6e0767a","transcript":"Hi"}},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" there"}},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":"!"}},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" How"}},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" can"}},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" I"}},"finish_reason":null}],"usage":null}
...
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"role":"assistant","content":null,"refusal":null,"audio":{"id":"audio_67153952b364819093d6a4aac6e0767a","data":"CgAEAAEACQABAAcABQAGAAgABwAKAAIAAgACAAQABwAFAAQAA..."}},"finish_reason":null}],"usage":null}
...
data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"data":"OCPRIZkeWRsBGB8V..."}},"finish_reason":null}],"usage":null}
...
data: [DONE]
It looks like the response alternates between chunks of audio data and audio transcript with the actual text content set to null.