Hello,
I am having an issue with the Realtime API where I create a session, then send a “session.update” event like so:
await websocket.send_json({
"type": "session.update",
"session": {
"modalities": ["audio", "text"],
"instructions": "Respond to the user with text, audio, and transcriptions.",
"input_audio_transcription": {
"model": "whisper-1"
},
"turn_detection": None,
"max_response_output_tokens": 4096,
}
})
I get a response from the server like this, so I have validation that my changes took effect:
{
'type': 'session.updated',
'event_id': 'event_ALtpk1rKmUrqEL72KFv5j',
'session': {
'id': 'sess_ALtpjQsv5vwQ7MvebMUDZ',
'object': 'realtime.session',
'model': 'gpt-4o-realtime-preview-2024-10-01',
'expires_at': 1729783775,
'modalities': ['audio', 'text'],
'instructions': 'Respond to the user with text, audio, and transcriptions.',
'voice': 'alloy',
'turn_detection': None,
'input_audio_format': 'pcm16',
'output_audio_format': 'pcm16',
'input_audio_transcription': {
'model': 'whisper-1'
},
'tool_choice': 'auto',
'temperature': 0.8,
'max_response_output_tokens': 4096,
'tools': []
}
}
I send a system message, followed by a response.create event like so:
await websocket.send_json({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "system",
"content": [
{
"type": "input_text",
"text": "Greet the user with text and audio."
}
]
}
})
await asyncio.sleep(0.2)
await websocket.send_json({
'type': 'response.create',
'response': {
'modalities': ['audio', 'text'],
'instructions': 'Greet the User with a friendly, helpful tone. use Text and Audio.'
}})
And I receive no audio responses back, only text, even though I’ve clearly specified audio + text, I get no responses of audio back from the server whatsoever. Also on subsequent user inputs that follow a similar flow, no audio comes back, only text. I am sending up audio, and it sends back text, so I know my audio is being transcribed correctly. But i get no audio back from the server. Anyone else experiencing this issue?