Many tokens have a corresponding token with a single space preceding the token.
New lines convert to a token. New lines as escaped characters are tokens. As for other ways of representing new lines you can check as needed.
The way to check is to use the OpenAI tokenization page with GPT-3
option.
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
Space example.
Text
hello goodbye hello
Tokens
Token Ids
[31373, 24829, 23748]
Notice that the token, token id: 31373, for hello
is different than the token, token id: 23748, for hello
which includes the preceding space.
New line examples
Text
Line 1
Line 2\n
Line 3\r
Line 4\r\n
Line 5
```
Line a
Line b
```
Tokens
Token Ids
[13949, 352, 198, 13949, 362, 59, 77, 198, 13949, 513, 59, 81, 198, 13949, 604, 59, 81, 59, 77, 198, 13949, 642, 198, 198, 15506, 63, 198, 13949, 257, 198, 13949, 275, 198, 15506, 63, 198]
Notice that Line 1
with the hidden new line converts to three tokens
Text
Token
Token Id
`Line`
13949
`1`
352
New line
198
So while new lines are not showing up in the textual representation of tokens, they are being created.