Realtime API - No response audio or audio deltas, despite modalities being set to ['audio', 'text']

Hello,
I am having an issue with the Realtime API where I create a session, then send a “session.update” event like so:

                    await websocket.send_json({
                        "type": "session.update",
                        "session": {
                            "modalities": ["audio", "text"],
                            "instructions": "Respond to the user with text, audio, and transcriptions.",
                            "input_audio_transcription": {
                                "model": "whisper-1"
                            },
                            "turn_detection": None,
                            "max_response_output_tokens": 4096,
                        }
                    })

I get a response from the server like this, so I have validation that my changes took effect:

{
    'type': 'session.updated',
    'event_id': 'event_ALtpk1rKmUrqEL72KFv5j',
    'session': {
        'id': 'sess_ALtpjQsv5vwQ7MvebMUDZ',
        'object': 'realtime.session',
        'model': 'gpt-4o-realtime-preview-2024-10-01',
        'expires_at': 1729783775,
        'modalities': ['audio', 'text'],
        'instructions': 'Respond to the user with text, audio, and transcriptions.',
        'voice': 'alloy',
        'turn_detection': None,
        'input_audio_format': 'pcm16',
        'output_audio_format': 'pcm16',
        'input_audio_transcription': {
            'model': 'whisper-1'
        },
        'tool_choice': 'auto',
        'temperature': 0.8,
        'max_response_output_tokens': 4096,
        'tools': []
    }
}

I send a system message, followed by a response.create event like so:

                 await websocket.send_json({
                        "type": "conversation.item.create",
                        "item": {
                            "type": "message",
                            "role": "system",
                            "content": [
                                {
                                    "type": "input_text",
                                    "text": "Greet the user with text and audio."
                                }
                            ]
                        }
                    })
                    await asyncio.sleep(0.2)
                    await websocket.send_json({
                        'type': 'response.create',
                        'response': {
                            'modalities': ['audio', 'text'],
                            'instructions': 'Greet the User with a friendly, helpful tone. use Text and Audio.'
                        }})

And I receive no audio responses back, only text, even though I’ve clearly specified audio + text, I get no responses of audio back from the server whatsoever. Also on subsequent user inputs that follow a similar flow, no audio comes back, only text. I am sending up audio, and it sends back text, so I know my audio is being transcribed correctly. But i get no audio back from the server. Anyone else experiencing this issue?

1 Like

For anyone else who has this issue, I did a couple things and then it started working:

  1. Changed my session instructions to match the github example as seen here GitHub - openai/openai-realtime-api-beta: Node.js + JavaScript reference client for the Realtime API (beta)
  2. Separated my session updates into two separate client events (also can be seen in the github example)
  3. Changed my initial system message to a user message that simply says “Hello” (Also from the github example)

I think it’s kind of weird that after doing these things it suddenly works - And I can’t actually verify with 100% certainty that some or all of these changes solved the problem, but I’ll take it.

2 Likes