Whisper Streaming Strategy

You can do some kind of pre prediction.

I mean in alot of sentences you already know what the person you are talking with
is

going

to

say

pretty

early

in

the sentence

Therefore you may want to try to give the incomplete sentence to a model in the background complete it and get an answer based on that completion.

You may also cache the beginning of an answer as an mp3 file and ask the assistant/tts to let the answer be continued after the beginning - I don’t know if that makes any sense to you.

Also random scenes e.g. something dropped in the background and the bot says stuff like “ooopsie sorry I dropped a knife… not what you are thinking now hahaha. a knife I used to eat on my desk… yeah, I know sometimes I eat at my desk”… which gives the backend enough time to generate

The answer would start with “Anyways, back to your question…” which is streamed as a cached mp3 and assistant was instructed to continue the answer after “Anyways, back to your question”.

Also in support hotlines you will find that many questions are asked over and over again.

Caching the answers in general and even doing some embeddings with either mp3 file location or a function call to create the answer might be working as well