I am curious about the circumstances, occasions, or reasons when we might use custom special tokens that can be declared in libraries like tiktoken, such as in examples added below. I am interested in understanding the use cases for custom special tokens. Does this have any connection with the use of delimiters in prompts?
- Tiktoken Link For JS.
For example :
// Extend existing encoding with custom special tokens
const enc = encoding_for_model("gpt2", {
"<|im_start|>": 100264,
"<|im_end|>": 100265,
});
Additionally, I am curious about the custom regex pattern passed to it, as in the example added below.
const encoder = new Tiktoken(
readFileSync("./ranks/gpt2.tiktoken").toString("utf-8"),
{ "<|endoftext|>": 50256, "<|im_start|>": 100264, "<|im_end|>": 100265 },
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+"
);
Thanks for all the help.