I know that audio with GPT-4o is not supported yet in the API, but I was wondering if the streaming SDK for audio will be similar to text? And, related: Will audio streaming even be a thing, or do you have to wait for the full audio response before you start playing?
My current framework makes a round trip—transcribe audio input, generate a response, turn that into audio output. I don’t use streaming, since everything is built around showing the text and audio once it’s ready, as well as saving them to a local database. I’m not inclined to rewrite the code to support text streaming unless audio streaming will be available in the future and look pretty similar. I’m using the Assistants API, by the way.
Any hints about what the future SDK will look like would be helpful!