Gpt-4o-audio-preview responds in text, not audio

const response = await openai.chat.completions.create({
		model: "gpt-4o-audio-preview",
		modalities: ["text", "audio"],
		audio: { voice: "alloy", format: "mp3" },
		messages,
	})

I am trying to request for audio output, but am receiving text output.

The above is exactly how I am requesting for audio output from the model.
I am able to receive audio output when the input messages are simple, but not when it gets a bit complex.

Example of a complex input messages:

const messages = [
    {
        role: "assistant",
        name: "interviewer",
        content: "Hi",
    },
    {
        role: "user",
        name: "student",
        content: "Hi",
    },
    {
        role: "user",
        name: "admin",
        content: "Ask the student a short question.",
    },
]

The bad response that I am receiving:

{
    id: "chatcmpl-xxx",
    object: "chat.completion",
    created: 1730903406,
    model: "gpt-4o-audio-preview-2024-10-01",
    choices: [
        {
            index: 0,
            message: {
                role: "assistant",
                content: "What subject are you currently focused on in your studies?",
                refusal: null,
            },
            finish_reason: "stop",
        },
    ],
    usage: {
        prompt_tokens: 31,
        completion_tokens: 11,
        total_tokens: 42,
        prompt_tokens_details: {
            cached_tokens: 0,
            audio_tokens: 0,
            text_tokens: 31,
            image_tokens: 0,
        },
        completion_tokens_details: {
            reasoning_tokens: 0,
            audio_tokens: 0,
            text_tokens: 11,
            accepted_prediction_tokens: 0,
            rejected_prediction_tokens: 0,
        },
    },
    system_fingerprint: "fp_xxx",
}

I’m not 100% sure what’s causing this issue, but including the “name” field isn’t the definitive cause, nor is the double “user” messages in a row, as I’ve had both cases produce a valid audio response when the input messages is just simpler.

The problem is that it simply will not maintain a voice conversation if you try to conduct a conversation transcription in text.

First, giving assistant a name is a bit pointless. You cannot give a matching name to the final assistant prompt that OpenAI uses where the AI writes its response.

You are immediately showing the AI model in that list of messages that the assistant responds with text, a pattern which it will almost never deviate from after being seen.

If you make a single system message (with lots of “voice enabled”) and single user input in text, you can usually get one spoken output.

To maintain a voice conversation, you are stuck sending ID numbers of the assistant audio it generated before back (which expire), and you would be better talking to it also.

2 Likes

You make some good points, thank you for the advice.

I’ll explore a bit more, and see how I can get it to generate audio with the cheapest input tokens as possible.

1 Like