I think other people have more or less answered the question “How can I use existing tools in to count tokens/tokenize?”, and great props to those people for their practical contributions. I do however think the thread’s titular question hasn’t been directly answered. Here’s my understanding from reading over the source code in tiktoken, though I admit I may have an imperfect understanding, as I’m an amateur in Rust, which is the language that does the heavy lifting for tiktoken’s encoder.
Suppose you call the following code
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("hello world")
In short, the get_encoding
function which is defined in the file tiktoken_ext/openai_public.py, which looks like this:
def cl100k_base():
mergeable_ranks = load_tiktoken_bpe(
"https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
)
special_tokens = {
ENDOFTEXT: 100257,
FIM_PREFIX: 100258,
FIM_MIDDLE: 100259,
FIM_SUFFIX: 100260,
ENDOFPROMPT: 100276,
}
return {
"name": "cl100k_base",
"pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""",
"mergeable_ranks": mergeable_ranks,
"special_tokens": special_tokens,
}
Notice the constructor returns pat_str
, a regex that others here have mentioned, as well as two collections: special_tokens
and mergeable_ranks
. special_tokens
is just a list of flags that get packaged with your query to help the language model understand where its job begins and end, and these get handled separately. mergeable_ranks
holds the contents of a tiktoken-specific formatted table which is loaded from the web. The rows in said table look like this:
IQ== 0
Ig== 1
Iw== 2
JA== 3
JQ== 4
Jg== 5
...
IGRldGVybWluaXN0aWM= 73449
IFVOSVFVRQ== 73450
IGV0dMOk 73451
U2luZ2xlTm9kZQ== 73452
...
As you can probably tell, these just assign some notion of precedence of which subsequences get merged into tokens first. These rankings are presumably learned in order to efficiently encode text into as few tokens as possible, but I imagine the learning algorithm is a guarded trade secret. In any case, judging by the uncommon looking sequences further down on the table, I bet that lower-ranked subsequences get merged first.
Getting back to the flow of our function call, get_encoding then pushes the cl100k_base
constructor down a pipeline which extracts these components and hands them to the “actual algorithm” which lives in a program compiled from src/lib.rs. Calling enc.encode("Hello World")
will run this algorithm. After the special tokens are filtered out, we end up with the algorithm control flow being defined by the following function
fn _encode_ordinary_native(&self, text: &str) -> Vec<usize> {
// This is the core of the encoding logic; the other functions in here
// just make things complicated :-)
let regex = self._get_tl_regex();
let mut ret = vec![];
for mat in regex.find_iter(text) {
let piece = mat.unwrap().as_str().as_bytes();
if let Some(token) = self.encoder.get(piece) {
ret.push(*token);
continue;
}
ret.extend(&byte_pair_encode(piece, &self.encoder));
}
ret
}
So this function first uses the regex to split the text into pieces. It then loops through those pieces. For each piece, it checks if that piece is a predefined token, and if so, it appends that to the end of ret
, a token vector. If the piece is not a token, the function this piece into byte_pair_encode
, which will parse that piece into one or more tokens and append them to the end of ret
.
pub fn byte_pair_encode(piece: &[u8], ranks: &HashMap<Vec<u8>, usize>) -> Vec<usize> {
if piece.len() == 1 {
return vec![ranks[piece]];
}
_byte_pair_merge(piece, ranks, |p| ranks[&piece[p.start..p.end]])
}
This function has two cases. If the piece has length one, it will return vec![ranks[piece]]
, basically indicating the piece comprises a single token, and its numerical “token value” is simply its rank from the table. Othewise, it will return the value of _byte_pair_merge(piece, ranks, |p| ranks[&piece[p.start..p.end]])
which specifically parses the piece into multiple tokens. That function is pretty long and complicated, so I won’t paste it here, but as best as I can tell, it splits the piece into a vector of byte pairs, called parts
. Then it greedily merges those parts into tokens based on the ranks from left to right iteratively over many scans until no further merges can be made.
TL;DR:
- Split string with regex into pieces.
- Some pieces get mapped to tokens immediately.
- Other pieces are split into byte pairs.
- Open AI hands us a table of rankings for sequences of byte pairs. The lower the rank, the more we like the sequence.
- The byte pairs in a piece are merged according to that table, always choosing the lowest rank merge possible.
- Merged byte pairs are finally mapped to tokens.
- Now you have a vector of tokens. Count them if you want to.