Speech-to-Speech (Audio Input/Output) with 4o

In the Realtime API announcement, it mentioned that

“We’re also introducing audio input and output in the Chat Completions API to support use cases that don’t require the low-latency benefits of the Realtime API. With this update, developers can pass any text or audio inputs into GPT-4o and have the model respond with their choice of text, audio, or both.”

Does this mean I can directly use audio files as my user prompt without having to transcribe it? If so, how can I do this? I was looking at the docs for Chat Completions and it does not seem to be updated on this topic.

3 Likes

looking at the API references page for chat completions, it will likely to be included into the message property:

A list of messages comprising the conversation so far. Depending on the model you use, different message types (modalities) are supported, like text, images, and audio.

the audio part link is not working yet so probably it will be updated soon

1 Like

I also wondered where this feature is, after they announced it in the Realtime API Announcement. There also don’t seem to be any changes to the sdk libraries, that would indicate a big audio update in any direction, yet.

1 Like

any news about this yet ? I read the announcement and was looking to send audio via the completions api

1 Like

I don’t think so, I’m expecting a “tts-2” model to release eventually, but who knows

I mean its already mentioned here : https://platform.openai.com/docs/api-reference/chat/create

but i guess they still working on it or something. maybe it will release when the beta is done ?
Anyway excited about this .