I am working on building a transcription script that takes in audio live from my microphone and is able to transcribe it into text. I am messing around with the served_vad and was wondering how I could get it such that it transcribes while I speak and not after I pause. I tried shortening some of the durations, but that seemed to greatly affect the accuracy. Would using DeepGram be better?
Maybe you can put something between the audio stream and the model?
I am working on a logic that allows for prediction of “what the person is most likely going to say” - so while you still speak it can start the construction of an answer and only if a score for an expected answer is high I can start streaming the answer… - which is more expensive though obviously - (but caching is a thing too).
That might be a little different - and takes a lot of efford but it also allows for special things like “analysing for prompt injection” (in that case the audio stream can even do a barge-in on the caller - lol - like streaming a “booooring” file or “a stop that - that makes no sense”), “playing with a smalltalk score when the bot has predefined goals to fullfill, … - like a little smalltalk / let the model create ONE poem about strawberries but fighting against abuse of a hotline” .. etc.
There is quiet a lot of stuff you can do beside “just” using a model wrapper.
I am also checking if there is another way of AI - where I use resonance instead of similarity. I think that is how the human brain does the prediction - it has same phase, amplitude, rythm, .. of something that the system knows - well, no need to create a prediction - we most likely have one.
Kind of like the job of the Amygdala..