Hi! I noticed that in the output of Whisper, it gives you tokens as well as an ‘avg_logprobs’ for that sequence of tokens.
I’m struggling currently to get some code working that’ll extract per-token logprobs as well as per-token timestamps.
I’m curious if this is even possible (I think it might be) but I also don’t want to do it in a hacky way that might be incorrect. Would love any help whatsoever!
Would also be curious if it’s possible to do per-token timestamps.
I think a potential use case of this is measuring speech intelligibility prediction, a la: [2204.04288] Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction
2 Likes
It would also be useful for subtitle and karaoke generation.
Words that are spoken need to leave the screen after they are spoken.
Currently, Whisper has timestamps hard-bumping into each other with no pauses in between.
It’s super awkward for both karaoke and subtitling.