How to make voice conversation look realistic like humans with latency of 200ms with whisper api ?
Can anybody achieve good latency with gpt 4o?
1 Like
That simply cannot be achieved.
Here is WAV, which doesn’t need to wait on a codec, showing the timing of http chunks started after sending the API request, for the phrase
input_text=“hello there, I’m making a wav file today”
The fastest trial:
= RESTART: C:\chat\speech-stream2.py
buffer chunk added: 0.0
buffer chunk added: 0.4129817485809326
buffer chunk added: 0.4691469669342041
buffer chunk added: 0.47687220573425293
buffer chunk added: 0.4840683937072754
buffer chunk added: 0.4925053119659424
buffer chunk added: 0.4925053119659424
buffer chunk added: 0.5014071464538574
buffer chunk added: 0.5014071464538574
buffer chunk added: 0.5110487937927246
buffer chunk added: 0.5110487937927246
buffer chunk added: 0.5110487937927246
buffer chunk added: 0.5240192413330078
buffer chunk added: 0.5299415588378906
buffer chunk added: 0.5316817760467529
buffer chunk added: 0.5400404930114746
buffer chunk added: 0.54500412940979
buffer chunk added: 0.54500412940979
buffer chunk added: 0.54500412940979
buffer chunk added: 0.5560660362243652
buffer chunk added: 0.5560660362243652
buffer chunk added: 0.5692315101623535
buffer chunk added: 0.5742206573486328
buffer chunk added: 0.5742206573486328
buffer chunk added: 0.5742206573486328
buffer chunk added: 0.5879650115966797
buffer chunk added: 0.5942468643188477
buffer chunk added: 0.5942468643188477
buffer chunk added: 0.6040196418762207
buffer chunk added: 0.6089839935302734
buffer chunk added: 0.6089839935302734
buffer chunk added: 0.6200499534606934
buffer chunk added: 0.6200499534606934
buffer chunk added: 0.6200499534606934
buffer chunk added: 0.6355159282684326
buffer chunk added: 0.6440179347991943
buffer chunk added: 0.6514706611633301
buffer chunk added: 0.6671669483184814
buffer chunk added: 0.6671669483184814
buffer chunk added: 0.6853346824645996
buffer chunk added: 0.6945838928222656
buffer chunk added: 0.7049524784088135
buffer chunk added: 0.7147464752197266
buffer chunk added: 0.7201125621795654
buffer chunk added: 0.732421875
buffer chunk added: 0.732421875
buffer chunk added: 0.7463281154632568
buffer chunk added: 0.7621917724609375
buffer chunk added: 0.7621917724609375
buffer chunk added: 0.7773334980010986
buffer chunk added: 0.7773334980010986
buffer chunk added: 0.8016865253448486
buffer chunk added: 0.8090205192565918
buffer chunk added: 0.8205859661102295
buffer chunk added: 0.826723575592041
buffer chunk added: 0.8325541019439697
That first chunk can be played only because it is WAV with actual samples, not larger frames of a codec. But a buffer at least as deep as this sentence is needed.
Then, with WAV, I’ve found that network might not keep up with realtime, or following chunks after the first sentence quickstart are not ready. You’d have to use a browser’s codec stream buffer class with aac, adding more prebuffering.
1 Like