GPT-4o text to speech and speech to text

It seems in the documentation that gpt-4o currently only takes text and image input. Are there any plans to allow it to take in audio data and return generated audio like we see in the demo videos? How would this be implemented, would they have to make websocket connections? What might the pricing look like?



I’m also interested in the roadmap for this functionality. Will the ability to stream audio to/from the model, as demonstrated in the demos, be available in the via API?


+100 me too I would love this. I want to stream audio as output please.


Audio support is coming in the future, but not available today.


Currently using Azure AI Speech API for speech/text interfacing to chat model. The Microsoft API supports streaming on-demand and continuous recognition. Will GPT-4o audio support still be file-based or will it be able to replace the Microsoft API?

I don’t think this information is public yet and it seems like OpenAI will announce this in the coming weeks.


Are we saying it will replace Whisper-1?

Whisper uses some clever technique to combine 30 second context windows into transcribing several hours.

This model on API would instead have some new context loading technique for voice for the purpose of chatting. 125k of input context means $0.60 per whatever amount of listening can be received, vs about $0.40 an hour on whisper.

And whisper doesn’t say, “I’m sorry, but I cannot complete that request”.

So unlikely to be a whisper replacement except for what has been demonstrated.

The voice transcription in chat on the GPT iphone app is so much better than Whisper. From what I’ve experienced with 100’s of hours of raw WAV transcription.

1 Like