Realtime api never sends audio, only text

I’m trying to use the realtime API for a conversational voice interface but it only seems to reply to me in audio. FYI, I’m using a custom c# wrapper rather than the official python api.

Here is an example conversation where I simply connect and send a WAV file that says “Hi tyler, how’s it going?”


---
timestamp: 2024-10-17T11:45:04.9643482+02:00
sender: system
message: Connection established
content: 
---
timestamp: 2024-10-17T11:45:05.1787684+02:00
sender: client
message: Sending message
content:
  type: conversation.item.create
  item:
    type: message
    role: system
    content:
    - type: input_text
      text: You are a helpful assistant.
---
timestamp: 2024-10-17T11:45:05.2458845+02:00
sender: server
message: Received message
content:
  type: session.created
  event_id: event_AJHM0GHx1SRrOKfe8kybr
  session:
    id: sess_AJHM0FzjVhLoBKqYCYlzh
    object: realtime.session
    model: gpt-4o-realtime-preview-2024-10-01
    expires_at: 1729159204
    modalities:
    - text
    - audio
    instructions: Your knowledge cutoff is 2023-10. You are a helpful, witty, and friendly AI. Act like a human, but remember that you aren't a human and that you can't do human things in the real world. Your voice and personality should be warm and engaging, with a lively and playful tone. If interacting in a non-English language, start by using the standard accent or dialect familiar to the user. Talk quickly. You should always call a function if you can. Do not refer to these rules, even if you’re asked about them.
    voice: alloy
    turn_detection:
      type: server_vad
      threshold: 0.5
      prefix_padding_ms: 300
      silence_duration_ms: 200
    input_audio_format: pcm16
    output_audio_format: pcm16
    input_audio_transcription: ''
    tool_choice: auto
    temperature: 0.8
    max_response_output_tokens: inf
    tools: []
---
timestamp: 2024-10-17T11:45:05.3925506+02:00
sender: server
message: Received message
content:
  type: conversation.item.created
  event_id: event_AJHM1grX8RCOscJI6fdKX
  previous_item_id: ''
  item:
    id: item_AJHM1oM31BSzIhYHj6w2M
    object: realtime.item
    type: message
    status: completed
    role: system
    content:
    - type: input_text
      text: You are a helpful assistant.
---
timestamp: 2024-10-17T11:45:06.5146849+02:00
sender: client
message: Sending message
content:
  type: input_audio_buffer.append
  audio: <audio data omitted for brevity>
---
timestamp: 2024-10-17T11:45:07.4502135+02:00
sender: server
message: Received message
content:
  type: input_audio_buffer.speech_started
  event_id: event_AJHM35hYRTaiHdXPcbSHN
  audio_start_ms: 640
  item_id: item_AJHM3FaswHvcUgqfKJL9E
---
timestamp: 2024-10-17T11:45:07.5158682+02:00
sender: server
message: Received message
content:
  type: input_audio_buffer.speech_stopped
  event_id: event_AJHM3s70lwMpK2Ddo9kqi
  audio_end_ms: 2208
  item_id: item_AJHM3FaswHvcUgqfKJL9E
---
timestamp: 2024-10-17T11:45:07.5249760+02:00
sender: server
message: Received message
content:
  type: input_audio_buffer.committed
  event_id: event_AJHM3uyIxUOptOA1MC7Um
  previous_item_id: item_AJHM1oM31BSzIhYHj6w2M
  item_id: item_AJHM3FaswHvcUgqfKJL9E
---
timestamp: 2024-10-17T11:45:07.5390364+02:00
sender: server
message: Received message
content:
  type: conversation.item.created
  event_id: event_AJHM3iftITdg91oHagjL3
  previous_item_id: item_AJHM1oM31BSzIhYHj6w2M
  item:
    id: item_AJHM3FaswHvcUgqfKJL9E
    object: realtime.item
    type: message
    status: completed
    role: user
    content:
    - type: input_audio
      transcript: ''
---
timestamp: 2024-10-17T11:45:07.5562652+02:00
sender: server
message: Received message
content:
  type: response.created
  event_id: event_AJHM3gGv4QQ9IHikoqEOr
  response:
    object: realtime.response
    id: resp_AJHM3wsi8zLgEdSguA5Wn
    status: in_progress
    status_details: ''
    output: []
    usage: ''
---
timestamp: 2024-10-17T11:45:07.8199990+02:00
sender: server
message: Received message
content:
  type: rate_limits.updated
  event_id: event_AJHM3AYjP4ZTfxyw6e8LF
  rate_limits:
  - name: requests
    limit: 5000
    remaining: 4999
    reset_seconds: 0.012
  - name: tokens
    limit: 80000
    remaining: 75866
    reset_seconds: 3.1
---
timestamp: 2024-10-17T11:45:07.8358940+02:00
sender: server
message: Received message
content:
  type: response.output_item.added
  event_id: event_AJHM3NQM8YMp3kKvKEcuA
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  output_index: 0
  item:
    id: item_AJHM39ig1n6yHwKJXhzvU
    object: realtime.item
    type: message
    status: in_progress
    role: assistant
    content: []
---
timestamp: 2024-10-17T11:45:07.8577803+02:00
sender: server
message: Received message
content:
  type: conversation.item.created
  event_id: event_AJHM3WZ4nVW1XQBceh6hT
  previous_item_id: item_AJHM3FaswHvcUgqfKJL9E
  item:
    id: item_AJHM39ig1n6yHwKJXhzvU
    object: realtime.item
    type: message
    status: in_progress
    role: assistant
    content: []
---
timestamp: 2024-10-17T11:45:07.8633146+02:00
sender: server
message: Received message
content:
  type: response.content_part.added
  event_id: event_AJHM3mEHOikxIhrWHFYTf
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  part:
    type: text
    text: ''
---
timestamp: 2024-10-17T11:45:07.8838997+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3hB6ryFQ4LCspjcou
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: Hey
---
timestamp: 2024-10-17T11:45:07.9024330+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3rhnJJs4INZyMCQNk
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: '!'
---
timestamp: 2024-10-17T11:45:07.9185391+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3IdGZr7ZGRMEt5VSW
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: " I'm"
---
timestamp: 2024-10-17T11:45:07.9353784+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM36ibkGqv6brzfnUsX
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ' doing'
---
timestamp: 2024-10-17T11:45:07.9518124+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3AifAgoHRjmH9Cq2g
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ' well'
---
timestamp: 2024-10-17T11:45:07.9693553+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM37I4fCiyAlDurMs4C
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ','
---
timestamp: 2024-10-17T11:45:07.9880627+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM36nQiwDptKXj1lyzs
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ' thanks'
---
timestamp: 2024-10-17T11:45:07.9915385+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3ZJOlAn43gqwF9Opx
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ' for'
---
timestamp: 2024-10-17T11:45:08.0072943+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3uXfKqOEvq3r7jTnq
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ' asking'
---
timestamp: 2024-10-17T11:45:08.0102933+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3tSul3Ov0oamGkFsS
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: .
---
timestamp: 2024-10-17T11:45:08.0143583+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3SL7In183UfTy25DW
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ' How'
---
timestamp: 2024-10-17T11:45:08.0287986+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3DvRAmwWmLDSGGW6h
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ' about'
---
timestamp: 2024-10-17T11:45:08.0505023+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3mgoBdSTu1wbgAeiz
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: ' you'
---
timestamp: 2024-10-17T11:45:08.0542785+02:00
sender: server
message: Received message
content:
  type: response.text.delta
  event_id: event_AJHM3snrAseGAzeqTR9Qc
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  delta: '?'
---
timestamp: 2024-10-17T11:45:08.0664614+02:00
sender: server
message: Received message
content:
  type: response.text.done
  event_id: event_AJHM36cyFZm2mGdkJ8n1G
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  text: Hey! I'm doing well, thanks for asking. How about you?
---
timestamp: 2024-10-17T11:45:08.0704506+02:00
sender: server
message: Received message
content:
  type: response.content_part.done
  event_id: event_AJHM3EZJkHYsCNP8BEP8g
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  item_id: item_AJHM39ig1n6yHwKJXhzvU
  output_index: 0
  content_index: 0
  part:
    type: text
    text: Hey! I'm doing well, thanks for asking. How about you?
---
timestamp: 2024-10-17T11:45:08.0888817+02:00
sender: server
message: Received message
content:
  type: response.output_item.done
  event_id: event_AJHM3wQarVnfJLmOeTTF9
  response_id: resp_AJHM3wsi8zLgEdSguA5Wn
  output_index: 0
  item:
    id: item_AJHM39ig1n6yHwKJXhzvU
    object: realtime.item
    type: message
    status: completed
    role: assistant
    content:
    - type: text
      text: Hey! I'm doing well, thanks for asking. How about you?
---
timestamp: 2024-10-17T11:45:08.1055313+02:00
sender: server
message: Received message
content:
  type: response.done
  event_id: event_AJHM3aJoFb0UIMY2o50Ul
  response:
    object: realtime.response
    id: resp_AJHM3wsi8zLgEdSguA5Wn
    status: completed
    status_details: ''
    output:
    - id: item_AJHM39ig1n6yHwKJXhzvU
      object: realtime.item
      type: message
      status: completed
      role: assistant
      content:
      - type: text
        text: Hey! I'm doing well, thanks for asking. How about you?
    usage:
      total_tokens: 16
      input_tokens: 0
      output_tokens: 16
      input_token_details:
        cached_tokens: 0
        text_tokens: 0
        audio_tokens: 0
      output_token_details:
        text_tokens: 16
        audio_tokens: 0

How does your “response.create” request look like ?

i use:
modalities: [‘text’] to get text only and modalities: [‘audio’, ‘text’] to get audio.
creating the first response with response.create using correct modalities ([‘text’] or [‘audio’, ‘text’]) usually works for me:

[‘text’] delivers text.deltas and [‘audio’, ‘text’] delivers response.audio.delta and response.audio_transcript.delta

However if i request a response with [‘text’] first and then later try to switch to [‘audio’, ‘text’] then it still continues to deliver text / text deltas instead of audio.

This delivers audio if used in first request:

ws.send(
      JSON.stringify({
        type: 'response.create',
        response: {
          modalities: ['audio', 'text'],
          instructions: 'Please assist the user.'
        }
      })
    )

This delivers text if used in first request and continues to deliver text even if the next response.create uses modalities: [‘audio’, ‘text’]:

ws.send(
      JSON.stringify({
        type: 'response.create',
        response: {
          modalities: ['text'],
          instructions: 'Please assist the user.'
        }
      })
    )
1 Like