Waiting for gpt-4o-audio-preview

As per the recent announcements:

Audio in the Chat Completions API will be released in the coming weeks, as a new model gpt-4o-audio-preview. With gpt-4o-audio-preview, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.

I am eagerly awaiting this new gpt-4o modality as I am focusing on building a mobile UX that combines chat with audio, and I find the realtime API a bit of a nightmare to setup.

Has there been any concrete news or timelines ragarding gpt-4o-audio-preview? Do we know how it will stream both text and audio chunks at the same time back to the client? I hope it’s not websockets! :slight_smile:

2 Likes

Websockets is the best way to make the two-way exchange of audio efficient… In the non-websocket case they’ll likely use Server Sent Events (SSE) like they do today. That’s essentially half-duplex websockets. And if you don’t want streaming at all I’m sure they’ll offer returning the fully encoded response as a file.

You’ll take a big latency hit waiting for the audio to fully generate though.

Sorry I should have clarified.

We use serverless architecture with AWS Lambda for the backend, so we cannot do long running processes, hence WS is not practical. Also, looking at our users’ behaviour, 95% use text-to-speech and only 5% use voice transcription, so what users really want is to output audio/text simultaneously (which can be done over SSE). The UX we want to create is kind of a push-to-talk Whatsapp-style experience.

1 Like

Blockquote
import { writeFileSync } from “node:fs”;
import OpenAI from “openai”;

const openai = new OpenAI();

// Generate an audio response to the given prompt
const response = await openai.chat.completions.create({
model: “gpt-4o-audio-preview”,
modalities: [“text”, “audio”],
audio: { voice: “alloy”, format: “wav” },
messages: [
{
role: “user”,
content: “Is a golden retriever a good family dog?”
}
]
});

// Inspect returned data
console.log(response.choices[0]);

// Write audio data to a file
writeFileSync(
“dog.wav”,
Buffer.from(response.choices[0].message.audio.data, ‘base64’),
{ encoding: “utf-8” }
);

Audio inputs are now available on chat completions @Dobo, using the model gpt-4o-audio-preview - which supports function calling well.

1 Like

Exciting! Thanks for sharing.

I just read the audio guide here:
https://platform.openai.com/docs/guides/audio/faq?lang=javascript&audio-generation-quickstart-example=audio-out

The examples are not using streaming so I wonder does this new model support streaming?

I would love to be able to stream both chunks of text and chunks of audio back to the client via SSE.

Regards

I wonder does this new model support streaming?

Yes, it does!

1 Like

Does this mean that you can stream the audio output ? Or do you still have to wait for the audio fille to be completed ? how does this work ?
Is there a example somewhere for playing the streamed audio while its coming in ? Havent played around with this yet

I can confirm that the chat completions endpoint seem to support streaming text and audio modalities at the same time.

The only supported audio format is pcm16.

Example:

{
    "model": "gpt-4o-audio-preview",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "modalities": ["text", "audio"],
    "audio": {
        "voice": "alloy",
        "format": "pcm16"
    },
    "stream_options": {
        "include_usage": true
    },
    "stream": true
  }

Response:

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"role":"assistant","refusal":null},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"content":null,"audio":{"id":"audio_67153952b364819093d6a4aac6e0767a","transcript":"Hi"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" there"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":"!"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" How"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" can"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" I"}},"finish_reason":null}],"usage":null}

...

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"role":"assistant","content":null,"refusal":null,"audio":{"id":"audio_67153952b364819093d6a4aac6e0767a","data":"CgAEAAEACQABAAcABQAGAAgABwAKAAIAAgACAAQABwAFAAQAA..."}},"finish_reason":null}],"usage":null}

...

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"data":"OCPRIZkeWRsBGB8V..."}},"finish_reason":null}],"usage":null}

...

data: [DONE]

It looks like the response alternates between chunks of audio data and audio transcript with the actual text content set to null.

I am not sure that the response includes the full token usage:

{
    "id": "chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T",
    "object": "chat.completion.chunk",
    "created": 1729444178,
    "model": "gpt-4o-audio-preview-2024-10-01",
    "system_fingerprint": "fp_4eafc16e9d",
    "choices": [],
    "usage": {
        "prompt_tokens": 19,
        "completion_tokens": 50,
        "total_tokens": 69,
        "prompt_tokens_details": {
            "cached_tokens": 0
        },
        "completion_tokens_details": {
            "reasoning_tokens": 0
        }
    }
}

I would assume that text tokens are different than audio tokens, but the usage metrics only show text tokens.

1 Like

After the gpt-4o-audio-preview model is called using streaming, the data in the generated data is voice data in the pcm16 format. How to save and play this sound data.

completion = client.chat.completions.create(
    model=model,
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "pcm16"},
    messages=[{"role": "user", "content": prompt}],
    stream=True,
)

A post was split to a new topic: How to replace my GPT TTS call for better performance?