Can whisper be prompted with a previous transcript?

I know whisper accepts prompts but I haven’t been able to successfully use them to my advantage.

I’m using whisper at the command line to create srt/lrc (same thing) files for music. These are a bit more than pure transcription because they return timing values. I don’t really know how that part works under the good, it’s all part of the whisper.exe and whisper-faster.exe distributions that i use.

Often times a TXT file of the lyrics to a song already exists.

When whisper makes the srt/lrc, it makes mistakes. Of course it’s not 100%.

Is there a way to prompt it with the TXT transcript so that it is less likely to make these mistakes?

I must be doing it wrong, because my prompt actually causes MORE mistakes.

TL;DR: I’m already using whisper to make SRT subtitle/karaoke files for music, but it makes mistakes, and I want to prompt it with pre-existing copies of the song’s lyrics so that it makes fewer mistakes.

The far simpler method, at least as far as computation is concerned, would be to take the returned text from whisper and the available .txt file of “official” lyrics and to compare one against the other with the original file being the source, any time a difference is detected the word is replaced, or you could conversely, parse the timecode from the Whisper generated file and match that up against the lyric text file and insert the time stamps.

The official lyrics are rarely a one to one match and often have spammy stuff inserted into them, missing lines, lines not repeated, credits, etc – dunno if you’ve ever seen what the output of 1000s of automatic lyric downloaders look like, but it’s really not clean or consistent enough for a trivial algorithm.

IMO it needs an intelligence, artificial or otherwise, to sift through that, for it to be trivial enough to be easily accomplishable.

perhaps there are some fun pre-existing package out there but it would be a hell of an algorithm

so anyway, this is the type of use case i thought the prompting of whisper could be used for.

well, you could feed the whisper lyrics and the official lyrics into GPT 3.5 or 4 and ask it to do it’s best to collaborate the two, maybe help it out with a difference engine output as well, i’d start with just the two files and ask it to build a cohesive whole from them.

Have you tried a txt file with the exact content as a reference?
I’m asking because in my experience data quality is of high importance and as you mentioned the quality is often low and may cause additional problems.
Trying your approach with a few 100% correct examples may give you a good feeling of this is the approach you actually want to use.

1 Like

Super valid point by @vb there, get a baseline and see if it’s going to work at all, you may even be able to provide a few examples with 3.5-16K it may cost a few cents per song at that point, but the alternative is human checking at multiple $

This is going to be done on somewhere between 5,000 and 40,000 songs. The data is not going to be perfect. Not even 10% of the time.

And yea, gpt-api would be way too expensive for this I think. I’d need to run a local instance. Which wouldn’t be the end of the world. I plan on churning this for months once I get it right-ish.

One thing to try is NVIDIA made an RTX AI vocal noise isolation system that could single out voices from surrounding noise, it was so good it was effectively able to remove anything not a voice. If you ran the songs through that prior to whisper processing, you may get increased accuracy, but there is still going to be some word deformation from artistic interpretation of word segments. Misheard lyrics being a thing even for human brains.

1 Like

Neglected to mention I’m already doing that.

–beam_size 5 also seems to help, though I don’t know why. Made the difference between 5% and 95% success rate on a Slayer song.

I’m using a standalone exe verison of whisper that is 25X faster or so than doing it in python, and a standalone version of demucs to do the wav separation.

i’m just trying to oomph it to be a little more perfect than it already is. It may make more sense to do this in postprocessing later and just proceed with generating my 40,000 LRC files

Yea, it seems very much a case of blind leading the blind if both datasets are arbitrary and of unknown accuracy… it’s super hard to build any kind of a self consistent model for next word prediction with song lyrics, a lot of the time they do not follow typical grammar or syntactic structures. I could see some kind of consensus system where multiple lyric files are compared to the whisper text and a majority voting system used.

The text ones tend are accurate, but noisy, and often missing things like repeating choruses, having credits or comments stuck at the end, sometimes a line of text that is obviously a text representation of a banner ad that appeared in the middle of lyrics. Stuff like that.

The words themselves are usually accurate, for what is there.

Accurate, but incomplete and noisy.

The AI ones are… not accurate. Whisper isn’t perfect, even with the vocal tracks split apart. Sections where most words are wrong via Whisper are unfortunately very prevalent