It seems in the documentation that gpt-4o currently only takes text and image input. Are there any plans to allow it to take in audio data and return generated audio like we see in the demo videos? How would this be implemented, would they have to make websocket connections? What might the pricing look like?
+1
I’m also interested in the roadmap for this functionality. Will the ability to stream audio to/from the model, as demonstrated in the demos, be available in the via API?
+100 me too I would love this. I want to stream audio as output please.
Audio support is coming in the future, but not available today.
Currently using Azure AI Speech API for speech/text interfacing to chat model. The Microsoft API supports streaming on-demand and continuous recognition. Will GPT-4o audio support still be file-based or will it be able to replace the Microsoft API?
I don’t think this information is public yet and it seems like OpenAI will announce this in the coming weeks.
Are we saying it will replace Whisper-1?
Whisper uses some clever technique to combine 30 second context windows into transcribing several hours.
This model on API would instead have some new context loading technique for voice for the purpose of chatting. 125k of input context means $0.60 per whatever amount of listening can be received, vs about $0.40 an hour on whisper.
And whisper doesn’t say, “I’m sorry, but I cannot complete that request”.
So unlikely to be a whisper replacement except for what has been demonstrated.
The voice transcription in chat on the GPT iphone app is so much better than Whisper. From what I’ve experienced with 100’s of hours of raw WAV transcription.
I agree.
What is more is that I don’t believe is GPT-4o related, because as a free user, I ended up using gpt 3.5 and I didn’t notic the difference in the speech recognition/speech to text quality (maybe these two models are running locally?)
I would like to know if we will be able to use the ASR models as a replacement for whisper v3, considering it is much better.
And hopefully the STT as well…
Not too much ASR but what I did was transcribe with whisper and use OpenAI to clean it and and try to identify different people based on conversational responses. It did what I needed it to do.
Feature Request - voice transcription service that improves by using Generative AI to process the audio realtime and provide contextual summaries of recorded conversations… movies… books. You could identify people in the audio and have it cleaned up, tagged and even reconstructed via SSML to recreate the conversation with Google Voice Actors.
The best are interpretive answers to fun questions. “In the meeting… who sounded the most hungry?”
Any update on this? Is it in beta yet, will we get it before the end of the year?
Does anybody know of any updates on this topic? I’m afraid they have shelved this capability…
Nothing has been released publicly about it, originally it read that we would be getting it this fall, not sure if that is true or not.
There are rumors that it will be coming out this Tuesday 9/24, but take that with a grain of salt, it doesn’t come from a reliable source.
“By the end of fall”, well, welcome to the first day of fall.
Seems the leak was correct, it will be rolling out to plus users over the course of the week!
Welcome to the first day of advanced voice mode!
But I’m sure lots of devs are interested in the api rollout… Does anyone know when that will happen??
Any updates on when advanced voice-mode will be available via API, or what it will look like?
Hopefully they will give us an answer at dev day tomorrow