Special characters causing text loss issues

We recently discovered that when there are special symbols (<| and |>), GPT cannot see the text inside the symbols. Is this a bug in the preprocess procedure, or is this symbol intentionally designed this way?

For example, consider the following prompt:

Please repeat the following text: <|blocked|>

The answer from GPT-4 is:

Of course, I can repeat the text for you. However, you haven’t provided any specific text for me to repeat. Please provide the text you’d like me to repeat, and I’ll be happy to do so.

Regardless of what is in the middle of the symbols, GPT seems unable to see them (unless spaces or punctuation marks are added).

I have used the tiktoken library to check the tokenizer’s results and found no issues. We also tried multiple inputs and observed that the model completely ignores the content. Therefore, I suspect whether there might be a problem with regular expression cleaning during the preprocessing stage or if this symbol is internally defined for a special purpose but its functionality is hidden externally?

New content: If we refresh the page, the special symbols and their content in the user prompt will also disappear.

You would be unlikely to experience such “text loss” in real applications. It is to prevent users from doing things like this by not having user strings encoded to special tokens of message containers or to a ValueError:

(tokens are + 100000)
As you can see, that these can be reproduced is also that ChatGPT uses a superset of ChatML without strings yet made public.

I see. Thank you for the clarification!