Quick question. I know that the realtime models have text and audio as modalities. Is it possible to have audio as input to the model and then as output to get the audio and the corresponding text as the transcript(without requiring an additional STT) ? Or is only 1 modality available for output?
you can specify a transcription model that runs in parallel to the realtime audio model. When you do this, you will get audio responses as usual, as well as text transcript contents.
handle these message types:
response.output_audio.delta → audio chunks from the model for your user to hear
conversation.item.input_audio_transcription.completed → text transcription of the audio your user has spoken to the model
response.output_audio_transcript.done → text transcripts of audio the model has spoken to your user.
set up the session like this (important keys marked with #<<<< so you can ignore keys that are specific to my use case):
Thank you very much @mcfinley . That’s the approach I am currently taking, with a second model in parallel. I was just wondering if what I am doing is inefficient and that the realtime model could ouput 2 modalities at the same time