Waiting for gpt-4o-audio-preview

Dobo · October 7, 2024, 10:02pm

As per the recent announcements:

Audio in the Chat Completions API will be released in the coming weeks, as a new model gpt-4o-audio-preview. With gpt-4o-audio-preview, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.

I am eagerly awaiting this new gpt-4o modality as I am focusing on building a mobile UX that combines chat with audio, and I find the realtime API a bit of a nightmare to setup.

Has there been any concrete news or timelines ragarding gpt-4o-audio-preview? Do we know how it will stream both text and audio chunks at the same time back to the client? I hope it’s not websockets!

stevenic · October 7, 2024, 10:26pm

Websockets is the best way to make the two-way exchange of audio efficient… In the non-websocket case they’ll likely use Server Sent Events (SSE) like they do today. That’s essentially half-duplex websockets. And if you don’t want streaming at all I’m sure they’ll offer returning the fully encoded response as a file.

You’ll take a big latency hit waiting for the audio to fully generate though.

Dobo · October 7, 2024, 10:48pm

Sorry I should have clarified.

We use serverless architecture with AWS Lambda for the backend, so we cannot do long running processes, hence WS is not practical. Also, looking at our users’ behaviour, 95% use text-to-speech and only 5% use voice transcription, so what users really want is to output audio/text simultaneously (which can be done over SSE). The UX we want to create is kind of a push-to-talk Whatsapp-style experience.

JoseRFJunior · October 18, 2024, 5:46am

Blockquote
import { writeFileSync } from “node:fs”;
import OpenAI from “openai”;

const openai = new OpenAI();

// Generate an audio response to the given prompt
const response = await openai.chat.completions.create({
model: “gpt-4o-audio-preview”,
modalities: [“text”, “audio”],
audio: { voice: “alloy”, format: “wav” },
messages: [
{
role: “user”,
content: “Is a golden retriever a good family dog?”
}
]
});

// Inspect returned data
console.log(response.choices[0]);

// Write audio data to a file
writeFileSync(
“dog.wav”,
Buffer.from(response.choices[0].message.audio.data, ‘base64’),
{ encoding: “utf-8” }
);

sps · October 18, 2024, 5:55am

Audio inputs are now available on chat completions @Dobo, using the model gpt-4o-audio-preview - which supports function calling well.

Dobo · October 18, 2024, 8:51am

Exciting! Thanks for sharing.

I just read the audio guide here:
https://platform.openai.com/docs/guides/audio/faq?lang=javascript&audio-generation-quickstart-example=audio-out

The examples are not using streaming so I wonder does this new model support streaming?

I would love to be able to stream both chunks of text and chunks of audio back to the client via SSE.

Regards

jpvx · October 19, 2024, 6:03am

I wonder does this new model support streaming?

Yes, it does!

Medy · October 19, 2024, 1:02pm

Does this mean that you can stream the audio output ? Or do you still have to wait for the audio fille to be completed ? how does this work ?
Is there a example somewhere for playing the streamed audio while its coming in ? Havent played around with this yet

Dobo · October 20, 2024, 5:15pm

I can confirm that the chat completions endpoint seem to support streaming text and audio modalities at the same time.

The only supported audio format is pcm16.

Example:

{
    "model": "gpt-4o-audio-preview",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "modalities": ["text", "audio"],
    "audio": {
        "voice": "alloy",
        "format": "pcm16"
    },
    "stream_options": {
        "include_usage": true
    },
    "stream": true
  }

Response:

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"role":"assistant","refusal":null},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"content":null,"audio":{"id":"audio_67153952b364819093d6a4aac6e0767a","transcript":"Hi"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" there"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":"!"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" How"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" can"}},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"transcript":" I"}},"finish_reason":null}],"usage":null}

...

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"role":"assistant","content":null,"refusal":null,"audio":{"id":"audio_67153952b364819093d6a4aac6e0767a","data":"CgAEAAEACQABAAcABQAGAAgABwAKAAIAAgACAAQABwAFAAQAA..."}},"finish_reason":null}],"usage":null}

...

data: {"id":"chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T","object":"chat.completion.chunk","created":1729444178,"model":"gpt-4o-audio-preview-2024-10-01","system_fingerprint":"fp_4eafc16e9d","choices":[{"index":0,"delta":{"audio":{"data":"OCPRIZkeWRsBGB8V..."}},"finish_reason":null}],"usage":null}

...

data: [DONE]

It looks like the response alternates between chunks of audio data and audio transcript with the actual text content set to null.

Dobo · October 20, 2024, 5:18pm

I am not sure that the response includes the full token usage:

{
    "id": "chatcmpl-AKTisMkErFxmr5c3wz1p88sWVxa0T",
    "object": "chat.completion.chunk",
    "created": 1729444178,
    "model": "gpt-4o-audio-preview-2024-10-01",
    "system_fingerprint": "fp_4eafc16e9d",
    "choices": [],
    "usage": {
        "prompt_tokens": 19,
        "completion_tokens": 50,
        "total_tokens": 69,
        "prompt_tokens_details": {
            "cached_tokens": 0
        },
        "completion_tokens_details": {
            "reasoning_tokens": 0
        }
    }
}

I would assume that text tokens are different than audio tokens, but the usage metrics only show text tokens.

hyhenryraymond · November 4, 2024, 9:07am

After the gpt-4o-audio-preview model is called using streaming, the data in the generated data is voice data in the pcm16 format. How to save and play this sound data.

completion = client.chat.completions.create(
    model=model,
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "pcm16"},
    messages=[{"role": "user", "content": prompt}],
    stream=True,
)

sps · November 4, 2024, 10:09pm

A post was split to a new topic: How to replace my GPT TTS call for better performance?

Topic		Replies	Views
Audio support in the Chat Completions API Announcements	13	3423	December 12, 2024
Will audio output streaming be available with GPT-4o? API audio , gpt-4o	1	961	June 20, 2024
Speech-to-Speech (Audio Input/Output) with 4o API	5	659	October 13, 2024
GPT4 audio preview with streaming of audio output API gpt-4	2	106	January 18, 2025
What will the GPT-4o audio API look like? API audio , gpt-4o	9	3653	October 2, 2024

Waiting for gpt-4o-audio-preview

Related topics