What will the GPT-4o audio API look like?

nimobeeren · May 15, 2024, 5:30pm

GPT-4o audio capabilities are not out yet, but I’m wondering if any information can be shared on what the API will look like?

One thing seems clear: it should be audio in, audio out. No more TTS or separate voice synthesis! This is really exciting already.

However, I’m wondering whether the interruption detection will be something that we need to implement ourselves, or will be part of the API in some way. If part of the API, we’d have to continuously stream audio to the server for it to pick up on any interruptions. Does that mean we’ll need to keep a HTTP connection open for the whole conversation? I’m not super familiar with the practical limitations of HTTP connections, but this feels like something that might be more suited to something like websockets.

Curious for any info from OpenAI team or interesting speculations!

PaulBellow · May 15, 2024, 5:58pm

Good questions, especially the streaming and interruption… I was thinking about that during the demo yesterday - the ability to “cut off” the LLM mid-output…

Following for answers…

RonaldGRuckus · May 15, 2024, 6:07pm

It seems almost guaranteed that it will require some constant connection between the client, you, and OpenAI. With the code being a transformation process in the pipeline.

Most likely using the same framework concepts as the Streams API

Before OpenAI we would accomplish this “interruption” by attaching a VAD (Voice Activity Detector).

https://speechprocessingbook.aalto.fi/Recognition/Voice_activity_detection.html

Knowing OpenAI and their “KISS” philosophy I would wager that they will handle all the heavy lifting. So really all that’s necessary is to maintain the two connections, passing information back and forth.

OR they will say “screw you, figure out interruptions yourself” LOL and then will maintain an SSE-like protocol of sending data, and then running the stream back to the client.

nimobeeren · May 15, 2024, 6:40pm

Hmm, having two connections would be possible, one for streaming in and one for streaming out. But then wouldn’t the client have to host a HTTP server to accept the incoming connection? And then tell the API server how to connect to it. This feels like the problem Websockets is designed to solve.

RonaldGRuckus · May 15, 2024, 6:55pm

Yeah I would imagine it would be accomplished using web sockets

maxpain · May 16, 2024, 5:11pm

Interesting. Will it be part of the Assistants API? Or will the Assistants API only support voice files without streaming, like images?

The same question about video streaming.

deliberatekids · June 11, 2024, 6:15pm

Either embeded the gpt4-o API internal or let developers do it by themselves. They need some kind of real-time streaming capabilities. Such as WebRTC or Agora RTE.

Trust me web-socket or http are not a good choice for large scale usecases

Topic		Replies	Views
GPT-4o Audio Access for API API gpt-4o	23	18984	June 18, 2024
Will audio output streaming be available with GPT-4o? API audio , gpt-4o	1	261	June 20, 2024
GPT-4o text to speech and speech to text API	12	10695	June 14, 2024
What will be the final/full released capabilities of GPT-4o in the API? API gpt-4 , chatgpt , api	0	1613	May 27, 2024
Voice and audio - gpt-4o - any updates? API api , speech , voice	1	1174	June 7, 2024

What will the GPT-4o audio API look like?

Related Topics