Realtime API: Updating Modalities

Hello there ChatGPT Forum

I noticed some problems when trying to update the modalities in the Realtime API:

As far as I can tell you can set the modalities either in the session.update event or the response.create event:

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "modalities": ["text", "audio"],  <---
        "instructions": "Your knowledge cutoff is 2023-10. You are a helpful assistant.",
        "voice": "alloy",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "whisper-1"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 500
        },
        "tools": [
            {
                "type": "function",
                "name": "get_weather",
                "description": "Get the current weather for a location, tell the user you are fetching the weather.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": { "type": "string" }
                    },
                    "required": ["location"]
                }
            }
        ],
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf"
    }
}


{
    "event_id": "event_234",
    "type": "response.create",
    "response": {
        "modalities": ["text", "audio"],  <--
        "instructions": "Please assist the user.",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tools": [
            {
                "type": "function",
                "name": "calculate_sum",
                "description": "Calculates the sum of two numbers.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "a": { "type": "number" },
                        "b": { "type": "number" }
                    },
                    "required": ["a", "b"]
                }
            }
        ],
        "tool_choice": "auto",
        "temperature": 0.7,
        "max_output_tokens": 150
    }
}

Switching from [“text”, “audio”] to only [“text”] partially works. It seems to also output the function args in the text deltas. However, switching back from [“text”] to [“text”, “audio”] does not work. All subsequent responses will still be text only. I can verify in the server-side event, that the session has indeed been updated with the correct modalities, but there is still only a text response.

I assume this is a bug? Has anyone also encountered it and was able to fix it?

4 Likes

I am also getting this issue, so you are not alone.

If it starts off immediately with a few [audio, text] responses it usually sticks to audio fairly well, but since the only way to add conversation history is with text messages, it means if you add any history at all it gets stuck always outputting text. Sometimes you can kind of trick it by telling it you changed its response parameters to output audio, but it’s very unreliable (and might also just be a spurious pattern that’s not anything real).

If you request text only it seems to love making up fake json responses, but for this I’ve had decent luck injecting a few fake messages with the assistant role into the chat history that have regular speech, and then it usually doesn’t do the json thing anymore (though again, pretty unreliable).

Have you had any luck finding a workaround?

EDIT: This works roughly consistently for me. Before injecting the conversation history, first create a conversation item that says something like “Initialize audio” (could be anything really, but this is least likely to pollute the conversation context), and then do a response.create asking for [text, audio], then inject all your conversation history after that. As long as it has a single example of having responded in audio, it won’t refuse to do audio later.

3 Likes

Okay, good to know it’s not just me. :upside_down_face:

I honestly haven’t played around with it too much since I figured that it is probably a bug from the API and focused on completing my other features so I have not found a solution yet. I’ll give your approach a go!

1 Like

I’m also experiencing this issue, sending any chat history before the first response causes the AI to respond only in text. I worked around it by including the chat history in the instructions, but it’s a bit hacky and adds undue importance to the chat history since instructions are used for all conversation items.

@levavakian 's workaround roughly works, but it’s not 100%, so we have to detect if it starts outputting text, reset the socket, and try again.

All in all, it would be great if this got an official/documented solution to force the model to output audio.

3 Likes

We are facing the exact same issue. Creating conversation items will randomly either return text or audio, when we would need audio every time. There should be a way to force audio output.

3 Likes

oh man. spent (wasted) quite a bit of time trying to figure it out :wink:

wondering if someone from openai knows about it

and btw it’s also workaround but this approach works:

1 Like

Yeah, it is weird. Somehow it does work right after the connection is establied (as in the workaround you posted), but not after that - once the session has been established. I am hoping they’ll fix it. I basically just wanted to implement a toggle functionality for muting/unmuting the audio at any time in the session and since I don’t want to pay for unnecessary audio tokens I wanted to switch to text only responses and vice versa.

1 Like

I have the same problem, when the first message is text, voice data is not sent then. But I didn’t have this problem with the previous version, I started having this problem with the new model introduced yesterday.

1 Like

Hmm interesting. I was not yet able to switch to their new SDK. I hoped they’d fixed that for this new update. I will try and check back.

1 Like

Hey all,

Alright, I revisited this issue after a while, since I de-prioritized this feature in development and now have had some time to take another crack at it.

So, @levavakian’s approach with additional prompting does work.

Here is how I have implemented it:

  1. User can toggle the audio playback in the UI
  2. This updates the definition of modalities in the client-side event “response.create”, i.e:
response_create_event = {
            'type': 'response.create',
            'response': {
               --> 'modalities': current_modality, <--
            }
        }
  1. It also updates the modalities in the session update event and sends this as a new event:
session_update_event = {
        'type': 'session.update',
        'session': {
           --> 'modalities': current_modality,<--
        }
    }
  1. I send a system message (without triggering a new response) informing the model that the modalities have been updated and to only respond in those modalities. OpenAI just re-vamped the whole system message design as a developer message, but I just send it as a normal user message (role=‘user’) with HTML tags <system-message>Message...</system-message>. This fixed the part where it would not correctly switch modalities.
  2. I included a section in the system instructions that explains that it is a multimodal AI with audio and text, and can freely switch between the modalities. I also instructed it to only output plain text, without any additional formatting or Tags. This fixed the JSON and function args in text responses.

So far, with only a few and quick tests, this has worked 100% and seems to have fixed the problem.

But this remains an issue that the OpenAI API Staff needs to look into.

I hope this helps.

Cheers,
Aaron

@levavakian @dnna @Marcin @lss @Marcin @rvy

2 Likes