GPT Assistant with Whisper Integration - Slow and Unreliable Voice-to-Voice Performance

Hi everyone,

I hope you’re all doing well!

Over the past month, I’ve been building a Flutter app that integrates Whisper and a GPT assistant. The idea is for users to speak to the app, have their voice transcribed by Whisper, and receive spoken responses from GPT. However, I’m encountering significant issues with performance:

  1. Delay: There’s a 4–5 second lag between sending the transcription to GPT and receiving a response. While text input to GPT responds almost instantly, voice-to-voice interactions are frustratingly slow.

  2. Reliability: The feature works only about 20% of the time. Even for short audio recordings, the response fails to come through most of the time.

I’m wondering:

  • Is this a limitation of the GPT API itself?
  • Could I optimize my implementation to reduce latency?
  • Should I consider switching to another AI model?

The 4–5 second delay is a deal-breaker for my app’s user experience. Any advice or guidance would be greatly appreciated!

Thank you!
Eric

  1. You can run whisper on an own server
  2. You can use pregenerated audiofiles with (one of a larger set of) beginning of a sentence e.g. something like “hmmm, let me think about that. Yeah, I guess”
    And then ask the GPT model to create a response that starts after “hmmm, let me think about that. Yeah, I guess”… this might give the user at least some kind of a better user experience - since you can play the cached audio right away.
  3. You might want to check if you can use a server that is closer to openai’s server location - somewhere in USA west most probably - where are you located btw?
  4. ahh and of course… you can cache the audio… and maybe even use embeddings of the response with teh name of the cached audio file… over time you migth have thousands of audio files let’s say inside a S3 storage and they should be usable kind of instantly - which btw makes it a lot cheaper I suppose…