Bug: response.create returns audio only with a single response.input

I’m seeing inconsistent behavior from the response.create endpoint when requesting audio output. If I include exactly one item in response.input, the endpoint reliably returns an audio response. However, as soon as I provide more than one item in response.input, the endpoint only returns a text transcript (no audio).

Model: gpt-4o-realtime-preview-2024-12-17
Endpoint: v1/realtime/sessions

This call returns both audio & text:

{ type: "response.create",
  response: { modalities: ["audio","text"], 
              output_audio_format: "pcm16",
              input: [{type: "message", role: "user", content: [{type: "input_text", text: "Tell me a joke"}]}
                     ]
            }
};

This call returns only text (no audio):

{ type: "response.create",
  response: { modalities: ["audio","text"], 
              output_audio_format: "pcm16",
              input: [{type: "message", role: "assistant", content: [{type: "text", text: "Hi I am your assistant, ask whatever."}]},
                      {type: "message", role: "user", content: [{type: "input_text", text: "Tell me a joke"}]}
                     ]
            }
};