How can I switch from text generation to audio generation?

palle · October 10, 2024, 2:19pm

Hi everyone!

I want to continue a previous text-only conversation with audio in the real-time API. To enable this, I create a session with text + audio modalities in the realtime API and then inject the previous text messages using conversation.item.create events, where each item has the text / input_text of the message sent by the assistant / user:

Assistant message:

{"event_id":"...","type":"conversation.item.create","item":{"id":"...","type":"message","status":"completed","role":"assistant","content":[{"type":"text","text":"..."}]}}

User message:

{"event_id":"...","type":"conversation.item.create","item":{"id":"...","type":"message","status":"completed","role":"user","content":[{"type":"input_text","text":"..."}]}}

However, this causes the realtime API to switch to text only mode. When I then send user audio (input_audio_buffer.append), voice activity detection etc… works just fine, but after the user finishes speaking, the realtime API generates text only, no audio.

When I do not inject existing messages in the beginning, this does not happen and the API works in audio mode as expected.

So, how can I inject a text conversation into the realtime API and continue with an audio conversation?

What I’ve tried so far:

Sending a session session.update with audio + text modalities after sending the conversation.item.create events - did not fix issue
Als tried sending a session.update with text only modality first, then the conversation items, then another update with text + audio - did not fix issue
Tried out text instead of input_text - rejected by the API

arund42 · October 11, 2024, 2:59pm

I had the same problem, and the only mitigation I found is to give all the history of the conversation as if it came with the user but with something like “[Assistant]” and “[User]” in front of the messages. I think combining that with the session.update tricks you mention made it more reliable as well, but that might just be placebo.

As a side note, the other scenario where I bumped into this is trying to reduce API cost by deleting previous audio conversation items and replacing them by the text transcript (since the cost per input text token is so much lower than the cost per audio token). That results in the same problem where the model, but if I keep the last 2 or ideally 3 assistant messages as audio, it seems to nearly always keep responding in audio, even though the older history is text.

Emphasis on nearly always, none of this seems 100% reliable, which would be a big problem in production (well, if the costs wouldn’t immediately bankrupt anyone using this in production anyway) - we really need a way to force the model to reply with audio…

redvivi · October 11, 2024, 9:34pm

Same thing right?

samiy8030 · October 13, 2024, 1:16am

I’m also experiencing this issue. Among the things you tried, also I tried changing the system prompt to say “ALWAYS RESPOND WITH AUDIO,” that did not help. I believe this is a bug with the API.

antonaf · October 15, 2024, 6:33pm

Any idea of how to resolve this? We implemented @arund42 's suggestion and it got us far, but it’s not reliable enough for anything nearing prod.

palle · October 21, 2024, 10:31am

I’ve spoken with the OpenAI support, however their response has been mostly unhelpful:

When using the real-time API with voice, the mode of communication (text or voice) is determined by the sequence of events and how you initiate the interaction. If you send a series of conversation.item.create events containing text-based user and assistant messages from a previous conversation, the model may assume the context is purely text-based and switch to text mode. Unfortunately, we cannot provide approximate steps for this process.

To prime the connection with a previous text conversation but ensure that the assistant remains in audio mode, you can try the following approach:

Send the conversation history as a summary or metadata: Instead of replaying the entire previous conversation using conversation.item.create, you can send a summary of the relevant context or conversation history in one batch as metadata. This way, the assistant can maintain the conversation context without switching to text mode.

Explicitly re-enable audio mode: After sending the conversation.item.create events, follow them with an event or command to explicitly switch back to audio mode. Depending on the API, this might involve sending a signal that the interaction should resume with audio output, such as:

Sending a special event to switch back to audio mode.

Sending an audio-based user query or some type of “voice start” event.

Keep the conversation flow in audio: If possible, avoid sending too many conversation.item.create events that include both user and assistant text messages. Instead, send a brief context-setting text and then follow it with an audio-based user input, which should keep the assistant in audio mode.

Check the documentation of the real-time API to see if there’s an explicit mode-switching event that can force the assistant back into voice/audio mode after handling text-based inputs. If available, use that to control the mode explicitly.

I’ll poke them some more to see if I can get anything actually useful.

hagen.rode · October 30, 2024, 9:08am

I’m currently doing session.update as per the post here: Realtime API: Did anybody managed to provide previous conversation transcript history while keeping audio answers? - #6 by hagen.rode

What are others doing?

j.wischnat · October 31, 2024, 12:38pm

I had this issue initially.

When sending a payload, make sure to still include BOTH modalities (text AND audio).

This will make the AI respond with audio.

Good luck!

nterapalli · November 10, 2024, 11:39am

Good job and thank you very much for that little hint

dennis17 · January 9, 2025, 7:22am

I encountered this same issue

I got it working by creating a summary of the previous conversation by using a chat completion ( a summary message created from the realtime session once it’s ended would work as well)

and then placing that summary in the new realtime session instructions

I tested it by asking it to give me random numbers and phrases in the first session, creating the summary, then, asking it to tell me those previous numbers and phrases again in the new session.

Hope this helps!

Gorgon · February 4, 2025, 11:46pm

no it will not. and that is the entire reason this thread exists. there is a bug.

Yankz · February 22, 2025, 9:54am

Is using session.update working for you?

Topic		Replies	Views
Realtime API: Did anybody managed to provide previous conversation transcript history while keeping audio answers? Bugs realtime	10	2530	February 19, 2025
Realtime API: Updating Modalities API voice , advanced-voice , realtime , api-realtime-speech	13	2724	July 8, 2025
Trouble Loading Previous Messages with Realtime API API realtime	10	1389	June 24, 2025
How to force Realtime API audio response? API realtime	3	532	January 30, 2025
No audio output after manually adding conversation items API realtime , api-realtime , api-realtime-speech	5	420	March 24, 2025

How can I switch from text generation to audio generation?

Related topics