I mean… maybe it could lead to insanity? Not sure. But, LLMs can also enter this situation, usually as a result of “greedy decoding”.
But again, that’s why Whisper uses Beam Search and why a temperature of 0 can help prevent that issue (slightly counter-intuitive).
Keep in mind that Whisper uses a timestamp-based sliding context window as well.
Whisper relies on accurate prediction of the timestamp tokens to determine the
amount to shift the model’s 30-second audio context window by, and inaccurate transcription in one window may
negatively impact transcription in the subsequent windows.
We have developed a set of heuristics that help avoid failure cases of long-form transcription, which is applied in
the results reported in sections 3.8 and 3.9. First, we use
beam search with 5 beams using the log probability as the
score function, to reduce repetition looping which happens
more frequently in greedy decoding. We start with temperature 0, i.e. always selecting the tokens with the highest probability, and increase the temperature by 0.2 up to
1.0 when either the average log probability over the generated tokens is lower than −1 or the generated text has a
gzip compression rate higher than 2.4. Providing the transcribed text from the preceding window as previous-text
conditioning when the applied temperature is below 0.5
further improves the performance