Whisper-Live ASR for live streaming transcription results in 8-10 sec delay. how to improve this?


I did a POC on whisper-live with medium model where I am facing 8-10 seconds delay in transcription and the captions are getting printed as paragraph. I am using 16GB RAM, 12th Gen intel(R) core processor windows PC. My requirement is to get the accurate transcription with low latency maximum up to 4 seconds and the captions should getting printed as word by word. is this possible with high performance VM configuration and whisper large-v2 model ?

I am expecting any proven record of Whisper for live streaming with great accuracy and max of 4 sec delay in transcription. Expecting the VM configuration which optimizes the performance, transcription delay in seconds, is word by word transcription is attainable using whisper large-V2 model ?