Realtime API: Updating Modalities

Hello there ChatGPT Forum

I noticed some problems when trying to update the modalities in the Realtime API:

As far as I can tell you can set the modalities either in the session.update event or the response.create event:

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "modalities": ["text", "audio"],  <---
        "instructions": "Your knowledge cutoff is 2023-10. You are a helpful assistant.",
        "voice": "alloy",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "whisper-1"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 500
        },
        "tools": [
            {
                "type": "function",
                "name": "get_weather",
                "description": "Get the current weather for a location, tell the user you are fetching the weather.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": { "type": "string" }
                    },
                    "required": ["location"]
                }
            }
        ],
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf"
    }
}


{
    "event_id": "event_234",
    "type": "response.create",
    "response": {
        "modalities": ["text", "audio"],  <--
        "instructions": "Please assist the user.",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tools": [
            {
                "type": "function",
                "name": "calculate_sum",
                "description": "Calculates the sum of two numbers.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "a": { "type": "number" },
                        "b": { "type": "number" }
                    },
                    "required": ["a", "b"]
                }
            }
        ],
        "tool_choice": "auto",
        "temperature": 0.7,
        "max_output_tokens": 150
    }
}

Switching from [“text”, “audio”] to only [“text”] partially works. It seems to also output the function args in the text deltas. However, switching back from [“text”] to [“text”, “audio”] does not work. All subsequent responses will still be text only. I can verify in the server-side event, that the session has indeed been updated with the correct modalities, but there is still only a text response.

I assume this is a bug? Has anyone also encountered it and was able to fix it?

2 Likes

I am also getting this issue, so you are not alone.

If it starts off immediately with a few [audio, text] responses it usually sticks to audio fairly well, but since the only way to add conversation history is with text messages, it means if you add any history at all it gets stuck always outputting text. Sometimes you can kind of trick it by telling it you changed its response parameters to output audio, but it’s very unreliable (and might also just be a spurious pattern that’s not anything real).

If you request text only it seems to love making up fake json responses, but for this I’ve had decent luck injecting a few fake messages with the assistant role into the chat history that have regular speech, and then it usually doesn’t do the json thing anymore (though again, pretty unreliable).

Have you had any luck finding a workaround?

EDIT: This works roughly consistently for me. Before injecting the conversation history, first create a conversation item that says something like “Initialize audio” (could be anything really, but this is least likely to pollute the conversation context), and then do a response.create asking for [text, audio], then inject all your conversation history after that. As long as it has a single example of having responded in audio, it won’t refuse to do audio later.

Okay, good to know it’s not just me. :upside_down_face:

I honestly haven’t played around with it too much since I figured that it is probably a bug from the API and focused on completing my other features so I have not found a solution yet. I’ll give your approach a go!