GPT-4o text to speech and speech to text

It seems in the documentation that gpt-4o currently only takes text and image input. Are there any plans to allow it to take in audio data and return generated audio like we see in the demo videos? How would this be implemented, would they have to make websocket connections? What might the pricing look like?

10 Likes

+1

I’m also interested in the roadmap for this functionality. Will the ability to stream audio to/from the model, as demonstrated in the demos, be available in the via API?

7 Likes

+100 me too I would love this. I want to stream audio as output please.

6 Likes

Audio support is coming in the future, but not available today.

6 Likes

Currently using Azure AI Speech API for speech/text interfacing to chat model. The Microsoft API supports streaming on-demand and continuous recognition. Will GPT-4o audio support still be file-based or will it be able to replace the Microsoft API?

I don’t think this information is public yet and it seems like OpenAI will announce this in the coming weeks.

2 Likes

Are we saying it will replace Whisper-1?

Whisper uses some clever technique to combine 30 second context windows into transcribing several hours.

This model on API would instead have some new context loading technique for voice for the purpose of chatting. 125k of input context means $0.60 per whatever amount of listening can be received, vs about $0.40 an hour on whisper.

And whisper doesn’t say, “I’m sorry, but I cannot complete that request”.

So unlikely to be a whisper replacement except for what has been demonstrated.

The voice transcription in chat on the GPT iphone app is so much better than Whisper. From what I’ve experienced with 100’s of hours of raw WAV transcription.

1 Like

I agree.
What is more is that I don’t believe is GPT-4o related, because as a free user, I ended up using gpt 3.5 and I didn’t notic the difference in the speech recognition/speech to text quality (maybe these two models are running locally?)
I would like to know if we will be able to use the ASR models as a replacement for whisper v3, considering it is much better.
And hopefully the STT as well…

Not too much ASR but what I did was transcribe with whisper and use OpenAI to clean it and and try to identify different people based on conversational responses. It did what I needed it to do.

Feature Request - voice transcription service that improves by using Generative AI to process the audio realtime and provide contextual summaries of recorded conversations… movies… books. You could identify people in the audio and have it cleaned up, tagged and even reconstructed via SSML to recreate the conversation with Google Voice Actors.

The best are interpretive answers to fun questions. “In the meeting… who sounded the most hungry?” :rofl:

Any update on this? Is it in beta yet, will we get it before the end of the year?

1 Like