Special characters causing text loss issues

XLXW · November 14, 2023, 5:11am

We recently discovered that when there are special symbols (<| and |>), GPT cannot see the text inside the symbols. Is this a bug in the preprocess procedure, or is this symbol intentionally designed this way?

For example, consider the following prompt:

Please repeat the following text: <|blocked|>

The answer from GPT-4 is:

Of course, I can repeat the text for you. However, you haven’t provided any specific text for me to repeat. Please provide the text you’d like me to repeat, and I’ll be happy to do so.

Regardless of what is in the middle of the symbols, GPT seems unable to see them (unless spaces or punctuation marks are added).

I have used the tiktoken library to check the tokenizer’s results and found no issues. We also tried multiple inputs and observed that the model completely ignores the content. Therefore, I suspect whether there might be a problem with regular expression cleaning during the preprocessing stage or if this symbol is internally defined for a special purpose but its functionality is hidden externally?

New content: If we refresh the page, the special symbols and their content in the user prompt will also disappear.

_j · November 14, 2023, 6:07am

You would be unlikely to experience such “text loss” in real applications. It is to prevent users from doing things like this by not having user strings encoded to special tokens of message containers or to a ValueError:

(tokens are + 100000)
As you can see, that these can be reproduced is also that ChatGPT uses a superset of ChatML without strings yet made public.

XLXW · November 14, 2023, 6:44am

I see. Thank you for the clarification!

Topic		Replies	Views
Support of unicode in gpt4-1106-preview Bugs gpt-4 , api	10	2362	November 15, 2024
The use of the separator "-----" results in different output API gpt-35-turbo , api	13	1761	March 1, 2024
ChatGPT struggles to express the end of text token API	1	2631	February 6, 2024
Problem with : - line break interrupts text as : is used Bugs text-completion	3	67	October 24, 2024
Bug sending prompt to gtp4o using the API, not in the browser Prompting gpt-4 , api	0	158	July 12, 2024

Special characters causing text loss issues

Related topics