Realtime API: Updating Modalities

aaron.lutz · October 29, 2024, 7:54am

Hello there ChatGPT Forum

I noticed some problems when trying to update the modalities in the Realtime API:

As far as I can tell you can set the modalities either in the session.update event or the response.create event:

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "modalities": ["text", "audio"],  <---
        "instructions": "Your knowledge cutoff is 2023-10. You are a helpful assistant.",
        "voice": "alloy",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "whisper-1"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 500
        },
        "tools": [
            {
                "type": "function",
                "name": "get_weather",
                "description": "Get the current weather for a location, tell the user you are fetching the weather.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": { "type": "string" }
                    },
                    "required": ["location"]
                }
            }
        ],
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf"
    }
}


{
    "event_id": "event_234",
    "type": "response.create",
    "response": {
        "modalities": ["text", "audio"],  <--
        "instructions": "Please assist the user.",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tools": [
            {
                "type": "function",
                "name": "calculate_sum",
                "description": "Calculates the sum of two numbers.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "a": { "type": "number" },
                        "b": { "type": "number" }
                    },
                    "required": ["a", "b"]
                }
            }
        ],
        "tool_choice": "auto",
        "temperature": 0.7,
        "max_output_tokens": 150
    }
}

Switching from [“text”, “audio”] to only [“text”] partially works. It seems to also output the function args in the text deltas. However, switching back from [“text”] to [“text”, “audio”] does not work. All subsequent responses will still be text only. I can verify in the server-side event, that the session has indeed been updated with the correct modalities, but there is still only a text response.

I assume this is a bug? Has anyone also encountered it and was able to fix it?

levavakian · November 9, 2024, 12:48pm

I am also getting this issue, so you are not alone.

If it starts off immediately with a few [audio, text] responses it usually sticks to audio fairly well, but since the only way to add conversation history is with text messages, it means if you add any history at all it gets stuck always outputting text. Sometimes you can kind of trick it by telling it you changed its response parameters to output audio, but it’s very unreliable (and might also just be a spurious pattern that’s not anything real).

If you request text only it seems to love making up fake json responses, but for this I’ve had decent luck injecting a few fake messages with the assistant role into the chat history that have regular speech, and then it usually doesn’t do the json thing anymore (though again, pretty unreliable).

Have you had any luck finding a workaround?

EDIT: This works roughly consistently for me. Before injecting the conversation history, first create a conversation item that says something like “Initialize audio” (could be anything really, but this is least likely to pollute the conversation context), and then do a response.create asking for [text, audio], then inject all your conversation history after that. As long as it has a single example of having responded in audio, it won’t refuse to do audio later.

aaron.lutz · November 11, 2024, 3:23pm

Okay, good to know it’s not just me.

I honestly haven’t played around with it too much since I figured that it is probably a bug from the API and focused on completing my other features so I have not found a solution yet. I’ll give your approach a go!

dnna · November 17, 2024, 9:37pm

I’m also experiencing this issue, sending any chat history before the first response causes the AI to respond only in text. I worked around it by including the chat history in the instructions, but it’s a bit hacky and adds undue importance to the chat history since instructions are used for all conversation items.

@levavakian 's workaround roughly works, but it’s not 100%, so we have to detect if it starts outputting text, reset the socket, and try again.

All in all, it would be great if this got an official/documented solution to force the model to output audio.

rvy · November 18, 2024, 1:32pm

We are facing the exact same issue. Creating conversation items will randomly either return text or audio, when we would need audio every time. There should be a way to force audio output.

Marcin · November 19, 2024, 1:31pm

oh man. spent (wasted) quite a bit of time trying to figure it out

wondering if someone from openai knows about it

and btw it’s also workaround but this approach works:

github.com/openai/openai-realtime-api-beta

Session configuration not set according to parameters

opened 10:43PM - 29 Oct 24 UTC

arnaudbreton

Hi, I'm facing a weird bug where the following code doesn't lead to the sessi…on configuration to being properly sent. On the browser-side (React Native through Expo): ```javascript client.updateSession({ turn_detection: { type: 'server_vad', threshold: 0.5 }, input_audio_transcription: { model: 'whisper-1' }, }); await client.connect(); ``` On the relay server: ```javascript clientWs.on('message', (data: any) => { const event = JSON.parse(data); console.debug(JSON.stringify(event, null, 2)); console.log(`Relaying "${event.type}" to OpenAI`); openAIWs.realtime.send(event.type, event); } ``` Most of the time I get the following logs: ``` Relaying "session.created" to Client Relaying "session.updated" to Client ``` In rare occasions, or if I set a breakdown in the `updateSession` method, I get the following log: ```json Relaying "session.update" to OpenAI æ "event_id": "evt_ZEcRWpyaoMgCY4amY", "type": "session.update", "session": æ "modalities": Æ "text", "audio" Å, "instructions": "", "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "input_audio_transcription": æ "model": "whisper-1" å, "turn_detection": æ "type": "server_vad", "threshold": 0.5 å, "tools": ÆÅ, "tool_choice": "auto", "temperature": 0.8, "max_response_output_tokens": 4096 å å ``` This prevents server_vad to be effective as the session configuration isn't sent. Thanks for your help

aaron.lutz · November 20, 2024, 11:00am

Yeah, it is weird. Somehow it does work right after the connection is establied (as in the workaround you posted), but not after that - once the session has been established. I am hoping they’ll fix it. I basically just wanted to implement a toggle functionality for muting/unmuting the audio at any time in the session and since I don’t want to pay for unnecessary audio tokens I wanted to switch to text only responses and vice versa.

lss · December 18, 2024, 12:49pm

I have the same problem, when the first message is text, voice data is not sent then. But I didn’t have this problem with the previous version, I started having this problem with the new model introduced yesterday.

aaron.lutz · December 18, 2024, 2:12pm

Hmm interesting. I was not yet able to switch to their new SDK. I hoped they’d fixed that for this new update. I will try and check back.

aaron.lutz · December 25, 2024, 3:10pm

Hey all,

Alright, I revisited this issue after a while, since I de-prioritized this feature in development and now have had some time to take another crack at it.

So, @levavakian’s approach with additional prompting does work.

Here is how I have implemented it:

User can toggle the audio playback in the UI
This updates the definition of modalities in the client-side event “response.create”, i.e:

response_create_event = {
            'type': 'response.create',
            'response': {
               --> 'modalities': current_modality, <--
            }
        }

It also updates the modalities in the session update event and sends this as a new event:

session_update_event = {
        'type': 'session.update',
        'session': {
           --> 'modalities': current_modality,<--
        }
    }

I send a system message (without triggering a new response) informing the model that the modalities have been updated and to only respond in those modalities. OpenAI just re-vamped the whole system message design as a developer message, but I just send it as a normal user message (role=‘user’) with HTML tags <system-message>Message...</system-message>. This fixed the part where it would not correctly switch modalities.
I included a section in the system instructions that explains that it is a multimodal AI with audio and text, and can freely switch between the modalities. I also instructed it to only output plain text, without any additional formatting or Tags. This fixed the JSON and function args in text responses.

So far, with only a few and quick tests, this has worked 100% and seems to have fixed the problem.

But this remains an issue that the OpenAI API Staff needs to look into.

I hope this helps.

Cheers,
Aaron

@levavakian @dnna @Marcin @lss @Marcin @rvy

levavakian · January 28, 2025, 11:43pm

Has this remained robust for you? Since switching to the 12-17 model, I am once again having trouble getting it to consistently respond in voice when given a chat history.

eli.mydlarz · February 20, 2025, 7:39pm

I can’t get any of the suggestion here to work well.

It works initially most of the time by sending a response.create as the first message with instructions to greet the user by audio.

The whole procedure is, start with turn_detection on but auto_reply off, then trigger the audio greeting, then add all my text conversation items, turn auto_reply on, then respond to the model verbally. At that point, it usually continues by voice until a tool call, then it switches to text. Not adequate at all.

jpoel · April 29, 2025, 2:52pm

This is still an issue. Any update?

Topic		Replies	Views
How can I switch from text generation to audio generation? API realtime	11	1312	February 22, 2025
Trouble Loading Previous Messages with Realtime API API realtime	9	636	May 16, 2025
Realtime API - No response audio or audio deltas, despite modalities being set to ['audio', 'text'] Bugs api	1	1132	October 24, 2024
Retrieving user response from Realtime Voice WebRTC API api	14	640	January 11, 2025
Realtime API: Did anybody managed to provide previous conversation transcript history while keeping audio answers? Bugs realtime	10	1991	February 19, 2025

Realtime API: Updating Modalities

Related topics