I have been building app that relied on open ai whisper as well as speech to text. The problem is the response time is too slow. It can get up to 12 second before it respond back with a voice. How can chatgpt app respond so fast? It’s always under 2 second.
It’s using a newer model for sure, one we don’t yet have access to via API for this modality.
I don’t work at OpenAI but I can imagine what steps they took to speed up the process:
-Using TPUs instead of GPUs
-Starting the transcription before the user has finished talking
-Starting the text-to-speech as the response gets streamed instead of waiting
-Lots of optimisations on the backend (probably some dark magic happening since they know exactly how their own models work)
EDIT: Assuming OP was talking about the current voice feature and not the new one that isn’t out yet. That one’s speed can probably be attributed to the fact that it’s just one multimodal AI instead of three separate ones.
The techniques you can use are:
- receive streaming response from chat completions, and send a sentence at a time for transcription as they are being received.
- buffer the TTS and immediately start playing, with the assumption that audio rendering is faster than realtime.
The size of response you accumulate before sending it can grow the further ahead the buffer size gets.
The app is unlikely to be powered by gpt-4o voice yet.
Yes I’m talking about the current one not the recently demoed version. I did STT on the device and the best I can get is around 5 second and at the cost of iaccuracy. My bot isn’t talking that much so I don’t think streaming would make it faster. I have a feeling there is some optimisation like u said. Maybe they start the model as soon as possible and if the user keeps talking they would just drop the response. But if I were to implement this it would be very expensive.
My main reason to not to use whisper is actually that is not made for real time interaction (streaming). Right now (lets see in the near future with the voice input in GPT-o) you should check other STT services. For me the best one (having in mind reliability, languages,speed and cost) is Amazon Transcribe. I tried already lots of them. Actually, the answer from GPT-o is really fast. About last step TTS generation, I also tried everything available in the market, and my current choice is Google Cloud for TTS, is pretty fast, not expensive and kind of ok with the quality for my case. Right now I achieve a kind of natural feeling of conversation regarding timing.