What will the GPT-4o audio API look like?

GPT-4o audio capabilities are not out yet, but I’m wondering if any information can be shared on what the API will look like?

One thing seems clear: it should be audio in, audio out. No more TTS or separate voice synthesis! This is really exciting already.

However, I’m wondering whether the interruption detection will be something that we need to implement ourselves, or will be part of the API in some way. If part of the API, we’d have to continuously stream audio to the server for it to pick up on any interruptions. Does that mean we’ll need to keep a HTTP connection open for the whole conversation? I’m not super familiar with the practical limitations of HTTP connections, but this feels like something that might be more suited to something like websockets.

Curious for any info from OpenAI team or interesting speculations!

5 Likes

Good questions, especially the streaming and interruption… I was thinking about that during the demo yesterday - the ability to “cut off” the LLM mid-output…

Following for answers…

2 Likes

It seems almost guaranteed that it will require some constant connection between the client, you, and OpenAI. With the code being a transformation process in the pipeline.

Most likely using the same framework concepts as the Streams API

Before OpenAI we would accomplish this “interruption” by attaching a VAD (Voice Activity Detector).

https://speechprocessingbook.aalto.fi/Recognition/Voice_activity_detection.html

Knowing OpenAI and their “KISS” philosophy I would wager that they will handle all the heavy lifting. So really all that’s necessary is to maintain the two connections, passing information back and forth.

OR they will say “screw you, figure out interruptions yourself” LOL and then will maintain an SSE-like protocol of sending data, and then running the stream back to the client.

2 Likes

Hmm, having two connections would be possible, one for streaming in and one for streaming out. But then wouldn’t the client have to host a HTTP server to accept the incoming connection? And then tell the API server how to connect to it. This feels like the problem Websockets is designed to solve.

Yeah I would imagine it would be accomplished using web sockets

Interesting. Will it be part of the Assistants API? Or will the Assistants API only support voice files without streaming, like images?

The same question about video streaming.