Realtime API - Cannot update a conversation's voice if assistant audio is present

Updating voice in Realtime API in the active websocket session works if session.update is sent before any audio has been received from the assistant. However, if there has been some audio received and user tries to update the session with new voice, there will be an error event “Cannot update a conversation’s voice if assistant audio is present”, is this by design or bug?

I would have expected the voice can be changed if response.audio.done has been received for the last session even inside the active session?

API reference:

It is by design, by the internal way that a chosen voice is patterned and placed with the parameter as assistant response.

Consider that the multimodal model itself has been trained on the speech tokens of thousands of speakers, and it is poised to not just recognize but to also reproduce anything (including your input) without extensive post-training and activation of that one assistant voice selection that continues. All the spoken output of a conversation would have to be regenerated by a reliable speech model of the replacement voice in order to switch, along with the hidden methodology, lest the model be trained in-context to become uncertain.

1 Like