Whisper Guide Documentation – Prompt limit inconsistency

Hello, I have an issue with Speech to text guide about Whisper prompt. In the first statement is that model only consider the final 224 tokens. In the second statement is that Whisper only considers the first 224 tokens of the prompt.

So which one is correct?

In Prompting > 2. scenario:

To preserve the context of a file that was split into segments, you can prompt the model with the transcript of the preceding segment. This will make the transcript more accurate, as the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier. For multilingual inputs, Whisper uses a custom tokenizer. For English only inputs, it uses the standard GPT-2 tokenizer which are both accessible through the open source Whisper Python package.

Later in Improving reliability > Using the prompt parameter:

Since it wasn’t trained using instruction-following techniques, Whisper operates more like a base GPT model. It’s important to keep in mind that Whisper only considers the first 244 tokens of the prompt.

1 Like

First one is if you use previous transcript as prompt which makes sense if your audio data is sequential.

Second one is if your prompt is like a rule for Whisper to follow. In the example, it was a list of words with correct spelling.

I just read through the docs and wondered the exact same thing. We only have one prompt parameter - does it cut off the end or the beginning if it’s too many tokens?

Make quite a big difference. Would be great to get some clarity here.

The question is still unanswered. The doc says both:

  • «The model will only consider the final 224 tokens of the prompt and ignore anything earlier»
  • «It’s important to keep in mind that Whisper only considers the first 244 tokens of the prompt»

It cannot be both…

Here: https://platform.openai.com/docs/guides/speech-to-text/prompting