The structure of realtime api. Why output audio_transcriptin rather than text?

Hi, all.
I am curious as to why “response.text.delta” is not employed to convey the audio transcription when the output modalities include both “text” and “audio”. This approach appears to align better with the input modalities. At present, the transcription is being output as “response.audio_transcription.delta”. Is there a specific rationale for this choice?

1 Like