Hello, I have an issue with Speech to text guide about Whisper prompt. In the first statement is that model only consider the final 224 tokens. In the second statement is that Whisper only considers the first 224 tokens of the prompt.
So which one is correct?
In Prompting > 2. scenario:
To preserve the context of a file that was split into segments, you can prompt the model with the transcript of the preceding segment. This will make the transcript more accurate, as the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier. For multilingual inputs, Whisper uses a custom tokenizer. For English only inputs, it uses the standard GPT-2 tokenizer which are both accessible through the open source Whisper Python package.
Later in Improving reliability > Using the prompt parameter:
Since it wasn’t trained using instruction-following techniques, Whisper operates more like a base GPT model. It’s important to keep in mind that Whisper only considers the first 244 tokens of the prompt.