I’m seeing inconsistent behavior from the response.create
endpoint when requesting audio output. If I include exactly one item in response.input
, the endpoint reliably returns an audio response. However, as soon as I provide more than one item in response.input
, the endpoint only returns a text transcript (no audio).
Model: gpt-4o-realtime-preview-2024-12-17
Endpoint: v1/realtime/sessions
This call returns both audio & text:
{ type: "response.create",
response: { modalities: ["audio","text"],
output_audio_format: "pcm16",
input: [{type: "message", role: "user", content: [{type: "input_text", text: "Tell me a joke"}]}
]
}
};
This call returns only text (no audio):
{ type: "response.create",
response: { modalities: ["audio","text"],
output_audio_format: "pcm16",
input: [{type: "message", role: "assistant", content: [{type: "text", text: "Hi I am your assistant, ask whatever."}]},
{type: "message", role: "user", content: [{type: "input_text", text: "Tell me a joke"}]}
]
}
};