What are the custom special tokens in tiktoken/token libraries? Use cases?

I am curious about the circumstances, occasions, or reasons when we might use custom special tokens that can be declared in libraries like tiktoken, such as in examples added below. I am interested in understanding the use cases for custom special tokens. Does this have any connection with the use of delimiters in prompts?

- Tiktoken Link For JS.
For example :

// Extend existing encoding with custom special tokens
const enc = encoding_for_model("gpt2", {
  "<|im_start|>": 100264,
  "<|im_end|>": 100265,
});

Additionally, I am curious about the custom regex pattern passed to it, as in the example added below.

const encoder = new Tiktoken(
  readFileSync("./ranks/gpt2.tiktoken").toString("utf-8"),
  { "<|endoftext|>": 50256, "<|im_start|>": 100264, "<|im_end|>": 100265 },
  "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+"
);

Thanks for all the help. :pray: :bowing_man:

The library shows its general-purpose origins.

Custom special tokens are not useful with OpenAI products, as you can’t add new string to token mappings in the endpoints, and in fact existing ones are not encoded for you, except for “<|endoftext|>” on completions, and those used as stop tokens when emitted by chat AI.

If you had your own AI model, you could use special token numbers (that the user can’t pass or simulate) to do things like train the AI to emit functions enclosed in special tokens that can be recognized by the endpoint, or enclose knowledge retrieval in a specially-recognized container.

model/token string: input tokens/output tokens
babbage-002/<|endoftext|>: 1/1
babbage-002/<|im_start|>: 6/1
babbage-002/<|im_end|>: 6/1

And yes, OpenAI is completely capable of putting chat models on a completion endpoint and encoding the tokens.