How can I switch from text generation to audio generation?

Hi everyone!

I want to continue a previous text-only conversation with audio in the real-time API. To enable this, I create a session with text + audio modalities in the realtime API and then inject the previous text messages using conversation.item.create events, where each item has the text / input_text of the message sent by the assistant / user:

Assistant message:

{"event_id":"...","type":"conversation.item.create","item":{"id":"...","type":"message","status":"completed","role":"assistant","content":[{"type":"text","text":"..."}]}}

User message:

{"event_id":"...","type":"conversation.item.create","item":{"id":"...","type":"message","status":"completed","role":"user","content":[{"type":"input_text","text":"..."}]}}

However, this causes the realtime API to switch to text only mode. When I then send user audio (input_audio_buffer.append), voice activity detection etc… works just fine, but after the user finishes speaking, the realtime API generates text only, no audio.

When I do not inject existing messages in the beginning, this does not happen and the API works in audio mode as expected.

So, how can I inject a text conversation into the realtime API and continue with an audio conversation?


What I’ve tried so far:

  • Sending a session session.update with audio + text modalities after sending the conversation.item.create events - did not fix issue
  • Als tried sending a session.update with text only modality first, then the conversation items, then another update with text + audio - did not fix issue
  • Tried out text instead of input_text - rejected by the API
4 Likes

I had the same problem, and the only mitigation I found is to give all the history of the conversation as if it came with the user but with something like “[Assistant]” and “[User]” in front of the messages. I think combining that with the session.update tricks you mention made it more reliable as well, but that might just be placebo.

As a side note, the other scenario where I bumped into this is trying to reduce API cost by deleting previous audio conversation items and replacing them by the text transcript (since the cost per input text token is so much lower than the cost per audio token). That results in the same problem where the model, but if I keep the last 2 or ideally 3 assistant messages as audio, it seems to nearly always keep responding in audio, even though the older history is text.

Emphasis on nearly always, none of this seems 100% reliable, which would be a big problem in production (well, if the costs wouldn’t immediately bankrupt anyone using this in production anyway) - we really need a way to force the model to reply with audio…

3 Likes

Same thing right?

1 Like

I’m also experiencing this issue. Among the things you tried, also I tried changing the system prompt to say “ALWAYS RESPOND WITH AUDIO,” that did not help. I believe this is a bug with the API.

Any idea of how to resolve this? We implemented @arund42 's suggestion and it got us far, but it’s not reliable enough for anything nearing prod.

2 Likes

I’ve spoken with the OpenAI support, however their response has been mostly unhelpful:

When using the real-time API with voice, the mode of communication (text or voice) is determined by the sequence of events and how you initiate the interaction. If you send a series of conversation.item.create events containing text-based user and assistant messages from a previous conversation, the model may assume the context is purely text-based and switch to text mode. Unfortunately, we cannot provide approximate steps for this process.

To prime the connection with a previous text conversation but ensure that the assistant remains in audio mode, you can try the following approach:

  1. Send the conversation history as a summary or metadata: Instead of replaying the entire previous conversation using conversation.item.create, you can send a summary of the relevant context or conversation history in one batch as metadata. This way, the assistant can maintain the conversation context without switching to text mode.
  1. Explicitly re-enable audio mode: After sending the conversation.item.create events, follow them with an event or command to explicitly switch back to audio mode. Depending on the API, this might involve sending a signal that the interaction should resume with audio output, such as:
  • Sending a special event to switch back to audio mode.
  • Sending an audio-based user query or some type of “voice start” event.
  1. Keep the conversation flow in audio: If possible, avoid sending too many conversation.item.create events that include both user and assistant text messages. Instead, send a brief context-setting text and then follow it with an audio-based user input, which should keep the assistant in audio mode.

Check the documentation of the real-time API to see if there’s an explicit mode-switching event that can force the assistant back into voice/audio mode after handling text-based inputs. If available, use that to control the mode explicitly.

I’ll poke them some more to see if I can get anything actually useful.

I’m currently doing session.update as per the post here: Realtime API: Did anybody managed to provide previous conversation transcript history while keeping audio answers? - #6 by hagen.rode

What are others doing?

I had this issue initially.

When sending a payload, make sure to still include BOTH modalities (text AND audio).

This will make the AI respond with audio.

Good luck! :hugs: