Realtime API Audio Modality output

Hello, I’m using audio & text modality for realtime api. Now my original instructions used to return a JSON output which contained a field ‘text’ which needed to be converted to audio but also other fields.

But now with realtime API, i’m trying to get the audio_transcript to still return the JSON format output but the audio to play only the ‘text’ field value from the JSON. It works sometimes and doesn’t work at times even after I specify this in instructions.

Any pointers on how to achieve this?

1 Like

As far as I understand, you are trying to do realtime stt and tts, which is not something that this model is made for. It can take information either as text or audio as input, but it can only respond to it, not transcribe it. Same with tts but in reverse, so you can’t just tell it to convert some text into audio

Let me clarify - I’m building a conversational agent where I’m passing in a text prompt which expects to do some evaluation and return a json with 2 fields - reply (text field) which is the agent reply to the prompt and complete (bool field) which indicates whether the conversation is completed or not.

Now while using text to audio realtime API, I’m trying to understand how to accomplish this since audio transcript contains a JSON with 2 fields but I want the audio to only contact audio content for the text field.

You can’t get the model to turn your text reply into audio because it can only “voice out” it’s own response. This is not a realtime text-to-speech API

I get why it doesn’t work out of the box, and that’s why I started this thread to look for any workarounds people may have to handle this whilst getting realtime API benefits. Since a lot of people who were using completions API with json response will likely move to realtime and may face similar issue

I really don’t know that to suggest then. I obviously don’t fully understand the case you are trying to solve with this, but I wonder why do you need the model to specifically turn the response text into audio? You also mentioned function calling if I understood correctly, but its intended purpose for Realtime API is definitely not to somehow workaround the model architecture and make it possible to turn text to audio