Realtime API - No response audio or audio deltas, despite modalities being set to ['audio', 'text']

alert-it · October 24, 2024, 3:24pm

Hello,
I am having an issue with the Realtime API where I create a session, then send a “session.update” event like so:

                    await websocket.send_json({
                        "type": "session.update",
                        "session": {
                            "modalities": ["audio", "text"],
                            "instructions": "Respond to the user with text, audio, and transcriptions.",
                            "input_audio_transcription": {
                                "model": "whisper-1"
                            },
                            "turn_detection": None,
                            "max_response_output_tokens": 4096,
                        }
                    })

I get a response from the server like this, so I have validation that my changes took effect:

{
    'type': 'session.updated',
    'event_id': 'event_ALtpk1rKmUrqEL72KFv5j',
    'session': {
        'id': 'sess_ALtpjQsv5vwQ7MvebMUDZ',
        'object': 'realtime.session',
        'model': 'gpt-4o-realtime-preview-2024-10-01',
        'expires_at': 1729783775,
        'modalities': ['audio', 'text'],
        'instructions': 'Respond to the user with text, audio, and transcriptions.',
        'voice': 'alloy',
        'turn_detection': None,
        'input_audio_format': 'pcm16',
        'output_audio_format': 'pcm16',
        'input_audio_transcription': {
            'model': 'whisper-1'
        },
        'tool_choice': 'auto',
        'temperature': 0.8,
        'max_response_output_tokens': 4096,
        'tools': []
    }
}

I send a system message, followed by a response.create event like so:

                 await websocket.send_json({
                        "type": "conversation.item.create",
                        "item": {
                            "type": "message",
                            "role": "system",
                            "content": [
                                {
                                    "type": "input_text",
                                    "text": "Greet the user with text and audio."
                                }
                            ]
                        }
                    })
                    await asyncio.sleep(0.2)
                    await websocket.send_json({
                        'type': 'response.create',
                        'response': {
                            'modalities': ['audio', 'text'],
                            'instructions': 'Greet the User with a friendly, helpful tone. use Text and Audio.'
                        }})

And I receive no audio responses back, only text, even though I’ve clearly specified audio + text, I get no responses of audio back from the server whatsoever. Also on subsequent user inputs that follow a similar flow, no audio comes back, only text. I am sending up audio, and it sends back text, so I know my audio is being transcribed correctly. But i get no audio back from the server. Anyone else experiencing this issue?

alert-it · October 24, 2024, 5:18pm

For anyone else who has this issue, I did a couple things and then it started working:

Changed my session instructions to match the github example as seen here GitHub - openai/openai-realtime-api-beta: Node.js + JavaScript reference client for the Realtime API (beta)
Separated my session updates into two separate client events (also can be seen in the github example)
Changed my initial system message to a user message that simply says “Hello” (Also from the github example)

I think it’s kind of weird that after doing these things it suddenly works - And I can’t actually verify with 100% certainty that some or all of these changes solved the problem, but I’ll take it.

Topic		Replies	Views
Realtime API: Updating Modalities API voice , advanced-voice , realtime , api-realtime-speech	11	1390	February 20, 2025
Even with “modalities” set to “text” only in Realtime API, Audio is occasionally generated Bugs realtime , api-realtime , api-realtime-speech	3	935	November 29, 2024
How can I switch from text generation to audio generation? API realtime	11	1189	February 22, 2025
Issue: OpenAI Realtime API Sometimes Only Responding with Text (No Audio) in Sessions With context API realtime , api-realtime	2	164	March 29, 2025
Realtime api never sends audio, only text API realtime	1	552	October 17, 2024

Realtime API - No response audio or audio deltas, despite modalities being set to ['audio', 'text']

Related topics