How to extract per-token logprobs + timestamps from Whisper?

Hi! I noticed that in the output of Whisper, it gives you tokens as well as an ‘avg_logprobs’ for that sequence of tokens.

I’m struggling currently to get some code working that’ll extract per-token logprobs as well as per-token timestamps.

I’m curious if this is even possible (I think it might be) but I also don’t want to do it in a hacky way that might be incorrect. Would love any help whatsoever!

Would also be curious if it’s possible to do per-token timestamps.

I think a potential use case of this is measuring speech intelligibility prediction, a la: [2204.04288] Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction

It would also be useful for subtitle and karaoke generation.

Words that are spoken need to leave the screen after they are spoken.

Currently, Whisper has timestamps hard-bumping into each other with no pauses in between.

It’s super awkward for both karaoke and subtitling.