const response = await openai.chat.completions.create({
model: "gpt-4o-audio-preview",
modalities: ["text", "audio"],
audio: { voice: "alloy", format: "mp3" },
messages,
})
I am trying to request for audio output, but am receiving text output.
The above is exactly how I am requesting for audio output from the model.
I am able to receive audio output when the input messages are simple, but not when it gets a bit complex.
Example of a complex input messages:
const messages = [
{
role: "assistant",
name: "interviewer",
content: "Hi",
},
{
role: "user",
name: "student",
content: "Hi",
},
{
role: "user",
name: "admin",
content: "Ask the student a short question.",
},
]
The bad response that I am receiving:
{
id: "chatcmpl-xxx",
object: "chat.completion",
created: 1730903406,
model: "gpt-4o-audio-preview-2024-10-01",
choices: [
{
index: 0,
message: {
role: "assistant",
content: "What subject are you currently focused on in your studies?",
refusal: null,
},
finish_reason: "stop",
},
],
usage: {
prompt_tokens: 31,
completion_tokens: 11,
total_tokens: 42,
prompt_tokens_details: {
cached_tokens: 0,
audio_tokens: 0,
text_tokens: 31,
image_tokens: 0,
},
completion_tokens_details: {
reasoning_tokens: 0,
audio_tokens: 0,
text_tokens: 11,
accepted_prediction_tokens: 0,
rejected_prediction_tokens: 0,
},
},
system_fingerprint: "fp_xxx",
}
I’m not 100% sure what’s causing this issue, but including the “name” field isn’t the definitive cause, nor is the double “user” messages in a row, as I’ve had both cases produce a valid audio response when the input messages is just simpler.