Realtime API message response - Audio + Text

Hi!

Does Realtime API support responses in both Audio and Text? If yes, how to implement it? How do I ask the model to split between Audio and Text?

As an example, if the model message was:
"Several public licenses allow open-source distribution of software while imposing restrictions on its use. Here are some common ones that provide varying levels of control over how the software can be used, modified, and redistributed:

GNU General Public License (GPL)
Use Case: Ensures that software and its derivatives remain open-source.
Restriction: Any derivative work must also be distributed under the same license, meaning if someone modifies your software, they must release their modifications as open-source.
GNU Affero General Public License (AGPL)
Use Case: Specifically designed for networked software (e.g., web apps).
Restriction: Requires that any changes made to the software, even if it’s just used over a network (like in a cloud service), must also be shared as open-source. It prevents proprietary forks used in hosted environments without releasing source code. "

This part of the message should be Audio (+ text):
“Several public licenses allow open-source distribution of software while imposing restrictions on its use. Here are some common ones that provide varying levels of control over how the software can be used, modified, and redistributed.”

The remaining part of the message (details) should be Text only. How can this be implemented? Can the model respond with two different message types “audio” and “text”?

Thanks for the help!!

I don’t see a easy way in the moment. You can choose modality [“text”] and you get text messages or you can choose modalities [“audio”, “text”] and you get audio and a transcript(text) for the audio.

You could probably do some modalitity switching and/or functions with specific instructions to get your results. However the API is in beta and has a lot of issues with even simpler tasks.

You probably have to wait until they get some of the existing issues fixed and then you can re-evaluate you use-case and possible solutions again.

Yes, I’ve racked my brain over it, tried adding some tags to the text transcripts, but it doesn’t work well. Plus, it’s expensive since you’re paying in real time and just getting text.

But we definitely need this on the API. In a normal conversation, you’d just say, “Yes, here are your options.”

PS: Trust me, I’ve noticed the issues. I spent a whole night fixing my code, only to find out the next day it was the API not responding properly… But still, I’m pretty impressed with the possibilities!