We recently discovered that when there are special symbols (<| and |>), GPT cannot see the text inside the symbols. Is this a bug in the preprocess procedure, or is this symbol intentionally designed this way?
For example, consider the following prompt:
Please repeat the following text: <|blocked|>
The answer from GPT-4 is:
Of course, I can repeat the text for you. However, you haven’t provided any specific text for me to repeat. Please provide the text you’d like me to repeat, and I’ll be happy to do so.
Regardless of what is in the middle of the symbols, GPT seems unable to see them (unless spaces or punctuation marks are added).
I have used the tiktoken
library to check the tokenizer’s results and found no issues. We also tried multiple inputs and observed that the model completely ignores the content. Therefore, I suspect whether there might be a problem with regular expression cleaning during the preprocessing stage or if this symbol is internally defined for a special purpose but its functionality is hidden externally?
New content: If we refresh the page, the special symbols and their content in the user prompt will also disappear.