In the Realtime API announcement, it mentioned that
“We’re also introducing audio input and output in the Chat Completions API to support use cases that don’t require the low-latency benefits of the Realtime API. With this update, developers can pass any text or audio inputs into GPT-4o and have the model respond with their choice of text, audio, or both.”
Does this mean I can directly use audio files as my user prompt without having to transcribe it? If so, how can I do this? I was looking at the docs for Chat Completions and it does not seem to be updated on this topic.
looking at the API references page for chat completions, it will likely to be included into the message property:
A list of messages comprising the conversation so far. Depending on the model you use, different message types (modalities) are supported, like text, images, and audio.
the audio part link is not working yet so probably it will be updated soon
I also wondered where this feature is, after they announced it in the Realtime API Announcement. There also don’t seem to be any changes to the sdk libraries, that would indicate a big audio update in any direction, yet.