GPT-4o text to speech and speech to text

brandonminiwheats · May 13, 2024, 6:24pm

It seems in the documentation that gpt-4o currently only takes text and image input. Are there any plans to allow it to take in audio data and return generated audio like we see in the demo videos? How would this be implemented, would they have to make websocket connections? What might the pricing look like?

float · May 13, 2024, 6:42pm

+1

I’m also interested in the roadmap for this functionality. Will the ability to stream audio to/from the model, as demonstrated in the demos, be available in the via API?

allenjlawson · May 13, 2024, 7:44pm

+100 me too I would love this. I want to stream audio as output please.

owencmoore · May 13, 2024, 8:33pm

Audio support is coming in the future, but not available today.

jsissler · May 13, 2024, 9:27pm

Currently using Azure AI Speech API for speech/text interfacing to chat model. The Microsoft API supports streaming on-demand and continuous recognition. Will GPT-4o audio support still be file-based or will it be able to replace the Microsoft API?

Thiago · May 13, 2024, 11:24pm

I don’t think this information is public yet and it seems like OpenAI will announce this in the coming weeks.

martinrobson · May 16, 2024, 1:09am

Are we saying it will replace Whisper-1?

_j · May 16, 2024, 1:14am

Whisper uses some clever technique to combine 30 second context windows into transcribing several hours.

This model on API would instead have some new context loading technique for voice for the purpose of chatting. 125k of input context means $0.60 per whatever amount of listening can be received, vs about $0.40 an hour on whisper.

And whisper doesn’t say, “I’m sorry, but I cannot complete that request”.

So unlikely to be a whisper replacement except for what has been demonstrated.

martinrobson · May 16, 2024, 5:54pm

The voice transcription in chat on the GPT iphone app is so much better than Whisper. From what I’ve experienced with 100’s of hours of raw WAV transcription.

ferluisxd · May 22, 2024, 7:01am

I agree.
What is more is that I don’t believe is GPT-4o related, because as a free user, I ended up using gpt 3.5 and I didn’t notic the difference in the speech recognition/speech to text quality (maybe these two models are running locally?)
I would like to know if we will be able to use the ASR models as a replacement for whisper v3, considering it is much better.
And hopefully the STT as well…

martinrobson · May 31, 2024, 4:07pm

Not too much ASR but what I did was transcribe with whisper and use OpenAI to clean it and and try to identify different people based on conversational responses. It did what I needed it to do.

Feature Request - voice transcription service that improves by using Generative AI to process the audio realtime and provide contextual summaries of recorded conversations… movies… books. You could identify people in the audio and have it cleaned up, tagged and even reconstructed via SSML to recreate the conversation with Google Voice Actors.

The best are interpretive answers to fun questions. “In the meeting… who sounded the most hungry?”

brandonminiwheats · June 13, 2024, 11:59am

Any update on this? Is it in beta yet, will we get it before the end of the year?

alexbu92 · September 22, 2024, 12:26pm

Does anybody know of any updates on this topic? I’m afraid they have shelved this capability…

brandon15 · September 23, 2024, 12:23am

Nothing has been released publicly about it, originally it read that we would be getting it this fall, not sure if that is true or not.

There are rumors that it will be coming out this Tuesday 9/24, but take that with a grain of salt, it doesn’t come from a reliable source.

_j · September 23, 2024, 12:47am

“By the end of fall”, well, welcome to the first day of fall.

brandon15 · September 24, 2024, 6:39pm

Seems the leak was correct, it will be rolling out to plus users over the course of the week!

brandon15 · September 24, 2024, 6:39pm

Welcome to the first day of advanced voice mode!

dark1 · September 25, 2024, 6:48am

But I’m sure lots of devs are interested in the api rollout… Does anyone know when that will happen??

fidika · September 25, 2024, 7:00am

Any updates on when advanced voice-mode will be available via API, or what it will look like?

brandonminiwheats · September 30, 2024, 2:10pm

Hopefully they will give us an answer at dev day tomorrow

Topic		Replies	Views
GPT-4o New Voice Model, API Release API	21	22361	July 23, 2024
Will the API for the New Voice Be Released Separately? API	4	3043	September 3, 2024
Advanced Voice Mode for API API	22	20139	October 5, 2024
True multimodal gpt4-omni from OpenAI's May Release, when and what? Community gpt-4 , chatgpt , assistants-api	4	697	July 30, 2024
When will audio to audio be released for gpt-4o please? API gpt-4o	8	4773	July 2, 2024

GPT-4o text to speech and speech to text

Related topics