How can chatgpt voice response so fast?

valehelle · May 17, 2024, 7:29am

I have been building app that relied on open ai whisper as well as speech to text. The problem is the response time is too slow. It can get up to 12 second before it respond back with a voice. How can chatgpt app respond so fast? It’s always under 2 second.

merefield · May 17, 2024, 8:51am

It’s using a newer model for sure, one we don’t yet have access to via API for this modality.

turbolucius · May 17, 2024, 8:51am

I don’t work at OpenAI but I can imagine what steps they took to speed up the process:

-Using TPUs instead of GPUs
-Starting the transcription before the user has finished talking
-Starting the text-to-speech as the response gets streamed instead of waiting
-Lots of optimisations on the backend (probably some dark magic happening since they know exactly how their own models work)

EDIT: Assuming OP was talking about the current voice feature and not the new one that isn’t out yet. That one’s speed can probably be attributed to the fact that it’s just one multimodal AI instead of three separate ones.

_j · May 17, 2024, 9:53am

The techniques you can use are:

receive streaming response from chat completions, and send a sentence at a time for transcription as they are being received.
buffer the TTS and immediately start playing, with the assumption that audio rendering is faster than realtime.

The size of response you accumulate before sending it can grow the further ahead the buffer size gets.

The app is unlikely to be powered by gpt-4o voice yet.

valehelle · May 17, 2024, 12:14pm

Yes I’m talking about the current one not the recently demoed version. I did STT on the device and the best I can get is around 5 second and at the cost of iaccuracy. My bot isn’t talking that much so I don’t think streaming would make it faster. I have a feeling there is some optimisation like u said. Maybe they start the model as soon as possible and if the user keeps talking they would just drop the response. But if I were to implement this it would be very expensive.

vr4content · May 17, 2024, 1:23pm

My main reason to not to use whisper is actually that is not made for real time interaction (streaming). Right now (lets see in the near future with the voice input in GPT-o) you should check other STT services. For me the best one (having in mind reliability, languages,speed and cost) is Amazon Transcribe. I tried already lots of them. Actually, the answer from GPT-o is really fast. About last step TTS generation, I also tried everything available in the market, and my current choice is Google Cloud for TTS, is pretty fast, not expensive and kind of ok with the quality for my case. Right now I achieve a kind of natural feeling of conversation regarding timing.

Topic		Replies	Views
How does ElevenLabs or Deepgram realtime voice agents work as good as OpenAI Realtime API? Community realtime	3	1820	February 26, 2025
How does the 'Call Annie' app achieve such remarkable speed with the ChatGPT API, and is it using stream mode? API api-speed	8	3771	September 24, 2024
How to reduce latency with GPT & Unity Requests API gpt-4 , api	2	506	July 3, 2024
Latency with STTTTS Pipeline API	0	53	July 2, 2025
Whisper Streaming Strategy API chatgpt , whisper , streaming	8	15639	June 30, 2025

How can chatgpt voice response so fast?

Related topics