Hi everyone!
I want to continue a previous text-only conversation with audio in the real-time API. To enable this, I create a session with text + audio modalities in the realtime API and then inject the previous text messages using conversation.item.create
events, where each item has the text / input_text of the message sent by the assistant / user:
Assistant message:
{"event_id":"...","type":"conversation.item.create","item":{"id":"...","type":"message","status":"completed","role":"assistant","content":[{"type":"text","text":"..."}]}}
User message:
{"event_id":"...","type":"conversation.item.create","item":{"id":"...","type":"message","status":"completed","role":"user","content":[{"type":"input_text","text":"..."}]}}
However, this causes the realtime API to switch to text only mode. When I then send user audio (input_audio_buffer.append
), voice activity detection etc… works just fine, but after the user finishes speaking, the realtime API generates text only, no audio.
When I do not inject existing messages in the beginning, this does not happen and the API works in audio mode as expected.
So, how can I inject a text conversation into the realtime API and continue with an audio conversation?
What I’ve tried so far:
- Sending a session
session.update
with audio + text modalities after sending theconversation.item.create
events - did not fix issue - Als tried sending a
session.update
with text only modality first, then the conversation items, then another update with text + audio - did not fix issue - Tried out
text
instead ofinput_text
- rejected by the API