How to extract per-token logprobs + timestamps from Whisper?

Hi! I noticed that in the output of Whisper, it gives you tokens as well as an ‘avg_logprobs’ for that sequence of tokens.

I’m struggling currently to get some code working that’ll extract per-token logprobs as well as per-token timestamps.

I’m curious if this is even possible (I think it might be) but I also don’t want to do it in a hacky way that might be incorrect. Would love any help whatsoever!

Would also be curious if it’s possible to do per-token timestamps.

I think a potential use case of this is measuring speech intelligibility prediction, a la: [2204.04288] Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction

2 Likes

It would also be useful for subtitle and karaoke generation.

Words that are spoken need to leave the screen after they are spoken.

Currently, Whisper has timestamps hard-bumping into each other with no pauses in between.

It’s super awkward for both karaoke and subtitling.