What are the custom special tokens in tiktoken/token libraries? Use cases?

ozan.adiguzel · December 14, 2023, 4:29pm

I am curious about the circumstances, occasions, or reasons when we might use custom special tokens that can be declared in libraries like tiktoken, such as in examples added below. I am interested in understanding the use cases for custom special tokens. Does this have any connection with the use of delimiters in prompts?

- Tiktoken Link For JS.
For example :

// Extend existing encoding with custom special tokens
const enc = encoding_for_model("gpt2", {
  "<|im_start|>": 100264,
  "<|im_end|>": 100265,
});

Additionally, I am curious about the custom regex pattern passed to it, as in the example added below.

const encoder = new Tiktoken(
  readFileSync("./ranks/gpt2.tiktoken").toString("utf-8"),
  { "<|endoftext|>": 50256, "<|im_start|>": 100264, "<|im_end|>": 100265 },
  "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+"
);

Thanks for all the help.

_j · December 14, 2023, 5:59pm

The library shows its general-purpose origins.

Custom special tokens are not useful with OpenAI products, as you can’t add new string to token mappings in the endpoints, and in fact existing ones are not encoded for you, except for “<|endoftext|>” on completions, and those used as stop tokens when emitted by chat AI.

If you had your own AI model, you could use special token numbers (that the user can’t pass or simulate) to do things like train the AI to emit functions enclosed in special tokens that can be recognized by the endpoint, or enclose knowledge retrieval in a specially-recognized container.

model/token string: input tokens/output tokens
babbage-002/<|endoftext|>: 1/1
babbage-002/<|im_start|>: 6/1
babbage-002/<|im_end|>: 6/1

And yes, OpenAI is completely capable of putting chat models on a completion endpoint and encoding the tokens.

Topic		Replies	Views
Using a Custom Tokenizer with GPT Embeddings API	5	3815	March 4, 2024
What is the reason for adding total 7 tokens? API chatgpt , api	12	3972	December 11, 2023
Is there a way to make a tokenizer using tiktoken lib API api	0	172	September 21, 2024
Prompt_tokens vs tiktoken.encoding_for_model().encode() Prompting gpt-35-turbo , token	4	5153	August 3, 2023
Identifying "Trigger Tokens" for Function Call API	0	314	March 29, 2024

What are the custom special tokens in tiktoken/token libraries? Use cases?

Related topics