Realtime API Audio Modality output

karanjain · October 16, 2024, 11:43pm

Hello, I’m using audio & text modality for realtime api. Now my original instructions used to return a JSON output which contained a field ‘text’ which needed to be converted to audio but also other fields.

But now with realtime API, i’m trying to get the audio_transcript to still return the JSON format output but the audio to play only the ‘text’ field value from the JSON. It works sometimes and doesn’t work at times even after I specify this in instructions.

Any pointers on how to achieve this?

ivan-luchkin-u · November 9, 2024, 6:48pm

As far as I understand, you are trying to do realtime stt and tts, which is not something that this model is made for. It can take information either as text or audio as input, but it can only respond to it, not transcribe it. Same with tts but in reverse, so you can’t just tell it to convert some text into audio

karanjain · November 9, 2024, 7:04pm

Let me clarify - I’m building a conversational agent where I’m passing in a text prompt which expects to do some evaluation and return a json with 2 fields - reply (text field) which is the agent reply to the prompt and complete (bool field) which indicates whether the conversation is completed or not.

Now while using text to audio realtime API, I’m trying to understand how to accomplish this since audio transcript contains a JSON with 2 fields but I want the audio to only contact audio content for the text field.

ivan-luchkin-u · November 9, 2024, 7:34pm

You can’t get the model to turn your text reply into audio because it can only “voice out” it’s own response. This is not a realtime text-to-speech API

karanjain · November 9, 2024, 7:56pm

I get why it doesn’t work out of the box, and that’s why I started this thread to look for any workarounds people may have to handle this whilst getting realtime API benefits. Since a lot of people who were using completions API with json response will likely move to realtime and may face similar issue

ivan-luchkin-u · November 9, 2024, 8:11pm

I really don’t know that to suggest then. I obviously don’t fully understand the case you are trying to solve with this, but I wonder why do you need the model to specifically turn the response text into audio? You also mentioned function calling if I understood correctly, but its intended purpose for Realtime API is definitely not to somehow workaround the model architecture and make it possible to turn text to audio

valdegg · December 13, 2024, 12:33pm

You can start a background process which has gpt-4o read in the transcript and analyse it into a json. GPT4o has structured output, which realtime api does not have.

ivan-luchkin-u · December 13, 2024, 5:29pm

Realtime API does have function calling so your response is a bit misleading. I doubt this will help him though

Topic		Replies	Views
How to get text only output from the Realtime API? API api , realtime	14	3699	June 20, 2025
Realtime API message response - Audio + Text API realtime	2	914	October 17, 2024
How can I pass a system prompt and audio user input to get a text output back? API	15	1372	November 3, 2024
Multimodal/realtime API - audio to text output, not transccription API api , multimodal	2	124	April 20, 2025
Is there any way to use realtime audio API and we can set a bahvioural prompt and configure the output to JSON? API api , prompt , assistants-api , gpt , gpt-4o-audio-preview	0	38	June 20, 2025

Realtime API Audio Modality output

Related topics