You still seem to suffer under some notion of “paragraphs”.
The AI won’t produce any special paragraph markers. It is trained on linefeeds.
| Representation |
ASCII Decimal |
ASCII Hex |
Escape Sequence |
| Linefeed (LF) |
10 |
0A |
\n |
AI can be coerced into \r\n, which is the “windows” linefeed combo vs the UNIX linefeed of just a \n, but that only applies to how files are saved.
Here are some actual BPE tokens, that might be the end of lines of computer code, and you can see in the first number how “popular” they are in training corpus.
[70927, 4, '}()\n']
[67917, 5, '}()\n\n']
[85794, 6, '}());\n']
[5526, 2, '})']
[3603, 3, '})\n']
[9001, 4, '})\n\n']
[44161, 5, '})\n\n\n']
[36200, 4, '})\r\n']
[71742, 6, '})\r\n\r\n']
[93450, 4, '})"\n']
[79709, 4, '})",']
[32989, 3, '})(']
[82275, 5, '})();']
[53512, 6, '})();\n']
[95446, 7, '})();\n\n']
[66406, 3, '}))']
[45295, 4, '}))\n']
[94697, 5, '}))\n\n']
[34727, 5, '}));\n']
[45417, 6, '}));\n\n']