Realtime (gpt-4o or any openai realtime model) output modalities

dcoste · September 23, 2025, 2:44pm

Hello,

Quick question. I know that the realtime models have text and audio as modalities. Is it possible to have audio as input to the model and then as output to get the audio and the corresponding text as the transcript(without requiring an additional STT) ? Or is only 1 modality available for output?

Thank you,

Darius

mcfinley · September 23, 2025, 3:30pm

you can specify a transcription model that runs in parallel to the realtime audio model. When you do this, you will get audio responses as usual, as well as text transcript contents.

handle these message types:
response.output_audio.delta → audio chunks from the model for your user to hear

conversation.item.input_audio_transcription.completed → text transcription of the audio your user has spoken to the model

response.output_audio_transcript.done → text transcripts of audio the model has spoken to your user.

set up the session like this (important keys marked with #<<<< so you can ignore keys that are specific to my use case):

        session_update_message = {
            "type": "session.update",
            "session": {
                "type": "realtime",
                "model": "gpt-realtime", #<<<< or whatever model you use
                "audio": {
                    "input": {
                        "format": {          
                            "type": "audio/pcm",
                            "rate": 24000
                        },
                        "noise_reduction": {"type":"far_field"},
                        "transcription": { 
                            "model": "gpt-4o-mini-transcribe" # <<<< or whatever you use
                            },
                        "turn_detection": {
                            "create_response": True,
                            "interrupt_response": False, 
                            "prefix_padding_ms": 300,
                            "silence_duration_ms": 800,
                            "threshold": 0.5,
                            "type": "server_vad"
                            }
                        },
                    "output": {
                        "format": {
                            "type": "audio/pcm",
                            "rate": 24000
                        },
                        "speed":1,
                        "voice": "marin",
                    }
                },
                "instructions": sp,
                "max_output_tokens": 1024,
                "output_modalities": ["audio"],
                "tool_choice": "auto",
                "tools":[],
                "tracing": None,
                "truncation":"auto"
            }
        }

dcoste · September 23, 2025, 3:42pm

Thank you very much @mcfinley . That’s the approach I am currently taking, with a second model in parallel. I was just wondering if what I am doing is inefficient and that the realtime model could ouput 2 modalities at the same time

Topic		Replies	Views
Can gpt-realtime Produce Transcription for Input and Output? API	1	105	August 30, 2025
Audio-preview \|\| how to get both audio and text output API	2	961	November 5, 2024
Realtime API Audio Modality output API realtime , api-realtime , api-realtime-speech	7	1203	December 13, 2024
Realtime API message response - Audio + Text API realtime	2	1212	October 17, 2024
Multimodal/realtime API - audio to text output, not transccription API api , multimodal	2	236	April 20, 2025

Realtime (gpt-4o or any openai realtime model) output modalities

Related topics