Hello there ChatGPT Forum
I noticed some problems when trying to update the modalities in the Realtime API:
As far as I can tell you can set the modalities either in the session.update event or the response.create event:
{
"event_id": "event_123",
"type": "session.update",
"session": {
"modalities": ["text", "audio"], <---
"instructions": "Your knowledge cutoff is 2023-10. You are a helpful assistant.",
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
"tools": [
{
"type": "function",
"name": "get_weather",
"description": "Get the current weather for a location, tell the user you are fetching the weather.",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
],
"tool_choice": "auto",
"temperature": 0.8,
"max_response_output_tokens": "inf"
}
}
{
"event_id": "event_234",
"type": "response.create",
"response": {
"modalities": ["text", "audio"], <--
"instructions": "Please assist the user.",
"voice": "alloy",
"output_audio_format": "pcm16",
"tools": [
{
"type": "function",
"name": "calculate_sum",
"description": "Calculates the sum of two numbers.",
"parameters": {
"type": "object",
"properties": {
"a": { "type": "number" },
"b": { "type": "number" }
},
"required": ["a", "b"]
}
}
],
"tool_choice": "auto",
"temperature": 0.7,
"max_output_tokens": 150
}
}
Switching from [“text”, “audio”] to only [“text”] partially works. It seems to also output the function args in the text deltas. However, switching back from [“text”] to [“text”, “audio”] does not work. All subsequent responses will still be text only. I can verify in the server-side event, that the session has indeed been updated with the correct modalities, but there is still only a text response.
I assume this is a bug? Has anyone also encountered it and was able to fix it?