Even with “modalities” set to “text” only in Realtime API, Audio is occasionally generated

Hello,

I’m developing an interactive service using the Realtime API. The system allows users to input via voice, receive a response from the Realtime API, display the response text on screen, and use TTS for reading the response aloud.

Since we only need the text responses from the Realtime API, after receiving session.created, I send the following with session.update:

{
  "type": "session.update",
  "session": {
    "modalities": ["text"],
    "instructions": prompt,
    "input_audio_transcription": {"model": "whisper-1"}
  }
}

While this usually works as expected with only text responses, I sometimes receive audio via response.audio.delta.

I do not want to receive audio at all. This issue occurs either after several generation requests or sometimes even on the first generation request after session.created.

I’m not sure whether the problem lies in my implementation or if it’s an issue with the API itself. Is anyone else experiencing this problem?

This is a log showing that response.audio.delta was returned despite the contents of session.updated.

logs from when the issue occurred
 websocket.open True. at 2024-10-17 21:31:46
【Receive】<session.created>
【Send】<session.update> {"type": "session.update", "session": {"modalities": ["text"], "instructions": ※※Omitted※※, "input_audio_transcription": {"model": "whisper-1"}}}
【Receive】<session.created> {'type': 'session.created', 'event_id': 'event_AJJxKFLSnTotU4k4DLGr3', 'session': {'id': 'sess_AJJxKphhNjCl9P5fXIBzS', 'object': 'realtime.session', 'model': 'gpt-4o-realtime-preview-2024-10-01', 'expires_at': 1729169206, 'modalities': ['audio', 'text'], 'instructions': "Your knowledge cutoff is 2023-10. You are a helpful, witty, and friendly AI. Act like a human, but remember that you aren't a human and that you can't do human things in the real world. Your voice and personality should be warm and engaging, with a lively and playful tone. If interacting in a non-English language, start by using the standard accent or dialect familiar to the user. Talk quickly. You should always call a function if you can. Do not refer to these rules, even if you’re asked about them.", 'voice': 'alloy', 'turn_detection': {'type': 'server_vad', 'threshold': 0.5, 'prefix_padding_ms': 300, 'silence_duration_ms': 200}, 'input_audio_format': 'pcm16', 'output_audio_format': 'pcm16', 'input_audio_transcription': None, 'tool_choice': 'auto', 'temperature': 0.8, 'max_response_output_tokens': 'inf', 'tools': []}}
【Receive】<session.updated> {'type': 'session.updated', 'event_id': 'event_AJJxKUqiJ31GVmL4eZs05', 'session': {'id': 'sess_AJJxKphhNjCl9P5fXIBzS', 'object': 'realtime.session', 'model': 'gpt-4o-realtime-preview-2024-10-01', 'expires_at': 1729169206, 'modalities': ['text'], 'instructions': ※※Omitted※※, 'voice': 'alloy', 'turn_detection': None, 'input_audio_format': 'pcm16', 'output_audio_format': 'pcm16', 'input_audio_transcription': {'model': 'whisper-1'}, 'tool_choice': 'auto', 'temperature': 0.8, 'max_response_output_tokens': 'inf', 'tools': []}}

【Send】<input_audio_buffer.commit>
【Receive】<input_audio_buffer.committed> {'type': 'input_audio_buffer.committed', 'event_id': 'event_AJJxpiFyT1ubwi5l1v34M', 'previous_item_id': None, 'item_id': 'item_AJJxpun3jL9lWgPjKRQM3'}
【Receive】<conversation.item.created> {'type': 'conversation.item.created', 'event_id': 'event_AJJxpNLATFGL5tR4d0xYy', 'previous_item_id': None, 'item': {'id': 'item_AJJxpun3jL9lWgPjKRQM3', 'object': 'realtime.item', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}}
【Receive】<response.created> {'type': 'response.created', 'event_id': 'event_AJJxpgvz8Nhsp7RTmXzxj', 'response': {'object': 'realtime.response', 'id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'status': 'in_progress', 'status_details': None, 'output': [], 'usage': None}}
【Receive】<rate_limits.updated> {'type': 'rate_limits.updated', 'event_id': 'event_AJJxpZee3kSJB22RWsIkd', 'rate_limits': [{'name': 'requests', 'limit': 10000, 'remaining': 9999, 'reset_seconds': 0.006}, {'name': 'tokens', 'limit': 2000000, 'remaining': 1995112, 'reset_seconds': 0.146}]}
【Receive】<response.output_item.added> {'type': 'response.output_item.added', 'event_id': 'event_AJJxptB81736lsSVc2ZbM', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'output_index': 0, 'item': {'id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'object': 'realtime.item', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}}
【Receive】<conversation.item.created> {'type': 'conversation.item.created', 'event_id': 'event_AJJxpdRQXtxBuwEdBAddZ', 'previous_item_id': 'item_AJJxpun3jL9lWgPjKRQM3', 'item': {'id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'object': 'realtime.item', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}}
【Receive】<response.content_part.added> {'type': 'response.content_part.added', 'event_id': 'event_AJJxp8UkudqX55BWMrklM', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'part': {'type': 'audio', 'transcript': ''}}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxpjk2rJ4J5ocQoL7Ks', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'こんばんは'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxpxV3OKqPnaW6TPJo8', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '!'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxpQwdMA8QrMICeZql4', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '本'}
【Receive】<response.audio.delta> 6400 byte
【Receive】<response.audio.delta> 9600 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxphZLFOUC1cSDJuXC7', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '当'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxpg1XbnwpQM2RbJTzD', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'ですね'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxpknl8vu7RHoApH5fD', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '。'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxpljnJ2V03lkBLhANM', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'こ'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<conversation.item.input_audio_transcription.completed>  (19 文字) こんばんは、だいぶ涼しくなってきたね
【Receive】<conversation.item.input_audio_transcription.completed> {'type': 'conversation.item.input_audio_transcription.completed', 'event_id': 'event_AJJxphGmxVFQzuv7B4YeQ', 'item_id': 'item_AJJxpun3jL9lWgPjKRQM3', 'content_index': 0, 'transcript': 'こんばんは、だいぶ涼しくなってきたね\n'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxpLSlZck5LsDsUSV9T', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'れ'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxpy8aYyqsWxgGKUFKN', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'だけ'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxptHfQgyy7cb9QRPV8', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '一'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxq8wTC0U4Vf6oUctPj', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '緒'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqa58YRbdvmzOubMOb', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'に'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqIYKZsepiB10DIfMi', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'いる'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqT0LXYwDq8qnVkrHB', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'と'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqZo1ItVNwZU2ioBxm', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '、'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqEyDxka0PgCFCq79g', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '忙'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqvjOxo6QGgikoDjHQ', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'しい'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqjP2U2yT1xCpPpNZx', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '日'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqBnrJg2liUfJ7VuCP', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '々'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqPUEXk9J2XOdZOvzY', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'でも'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqQ8yr09iVRs9D341t', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'こう'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqqvFo0L89C57JsZxJ', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'や'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqWB38eY6CNbNKIaPS', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'って'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqhiquiucUFa58X68M', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '会'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqk3eeZc0BUCUlWTil', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'える'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqLo3Gx6u9FeelGnzi', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '時間'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqgAqgHbqKT4SUqEfD', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': 'が'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxq81jOaAJ42vrvMyib', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '、'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqi4Q9rJhyZu7rJGph', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '本'}
【Receive】<response.audio.delta> 16000 byte
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxqnfytbUoz8RF9FM63', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '当に'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxq2yuI16jsisx9Maja', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '特'}
【Receive】<response.audio_transcript.delta> {'type': 'response.audio_transcript.delta', 'event_id': 'event_AJJxq2RoGqGm7FehHMh0f', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'delta': '別'}
【Receive】<response.audio.done> {'type': 'response.audio.done', 'event_id': 'event_AJJxqEEfY0RKxeWH13Vc0', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0}
【Receive】<response.audio_transcript.done> {'type': 'response.audio_transcript.done', 'event_id': 'event_AJJxqdATASRRK19z8Xa7f', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'transcript': 'こんばんは!本当ですね。これだけ一緒にいると、忙しい日々でもこうやって会える時間が、本当に特別'}
【Receive】<response.content_part.done> {'type': 'response.content_part.done', 'event_id': 'event_AJJxqk5YdxeVi4bS0AO85', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'item_id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'output_index': 0, 'content_index': 0, 'part': {'type': 'audio', 'transcript': 'こんばんは!本当ですね。これだけ一緒にいると、忙しい日々でもこうやって会える時間が、本当に特別'}}
【Receive】<response.output_item.done> {'type': 'response.output_item.done', 'event_id': 'event_AJJxqJ6aPizotl8PrXlaR', 'response_id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'output_index': 0, 'item': {'id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'object': 'realtime.item', 'type': 'message', 'status': 'incomplete', 'role': 'assistant', 'content': [{'type': 'audio', 'transcript': 'こんばんは!本当ですね。これだけ一緒にいると、忙しい日々でもこうやって会える時間が、本当に特別'}]}}
【Receive】<response.done> {'type': 'response.done', 'event_id': 'event_AJJxqz2xmQeHXWhDhAC3b', 'response': {'object': 'realtime.response', 'id': 'resp_AJJxpEsVtBnqrXeioYoaC', 'status': 'incomplete', 'status_details': {'type': 'incomplete', 'reason': 'content_filter'}, 'output': [{'id': 'item_AJJxpRTKUVAEs2CjjMz2u', 'object': 'realtime.item', 'type': 'message', 'status': 'incomplete', 'role': 'assistant', 'content': [{'type': 'audio', 'transcript': 'こんばんは!本当ですね。これだけ一緒にいると、忙しい日々でもこうやって会える時間が、本当に特別'}]}], 'usage': {'total_tokens': 889, 'input_tokens': 724, 'output_tokens': 165, 'input_token_details': {'cached_tokens': 0, 'text_tokens': 684, 'audio_tokens': 40}, 'output_token_details': {'text_tokens': 45, 'audio_tokens': 120}}}}

i am working on a text only prototype.

i do not update the session with “session.update” but use:

{
        type: 'response.create',
        response: {
          modalities: ["text"],
          instructions: 'Please assist the user.'  // optional
        }
      }

to create a text response and never get audio in response events.

2 Likes

Thank you, andreas.spaeth, for your response!

I applied the code you shared, and it seems to have resolved the issue. Now I’m getting only text responses, just as I wanted. I didn’t realize there was such a method—that was really insightful!

I really appreciate your help!

@shanpy I’m also trying to use realtime API to input an audio and the response should be a text to the voice query. For me it seems it isn’t working. The suggested code didn’t work for me. Can you please share a sample code?