Hi Flutter Developers,
So, in my Flutter Project, I am required to use voice chatbot.
I am using native STT, and then open AI tts for the responses.
However, what I notice is - in Android devices the response time (the playback heard for the first time) is 7-8 seconds, where as in iOS device it is (12-15 seconds)
I am already streaming the audio responses (response_request: pcm) -
Is it normal for such high response time or can we reduce the latency further for both android and iOS?
models I am using - tts1, pcm,
I had the same challenge and switched to the realtime api with websockets… latency gone! sorry maybe not what you want but if you make the leap, you wont regret it.
so tts and stt both will be handled by open AI? Cost is the tradeoff. right?
Exactly. You can make three calls STT → Chat Completion → TTS which is slow but less expensive, or use realtime to do it all at once and pay more. And don’t forget if you do STT yourself you also have to do voice detection, turn detection and interruption handling. I put a lot of work into all of this and had a solution that worked but was really unacceptable for users because of delays and “rules” you had to follow. Once I went to audio realtime, all of that went away. One suggestion would be to at least pilot using gpt-realtime so you can focus on making it a great AI rather than being stuck dealing with audio synchronization troubles. Then later on if token costs are your limiting factor, invest in more engineering. But even then you just can’t get to the quality of interaction that happens when the model itself understands audio tokens natively.
1 Like