What is the OpenAI algorithm to calculate tokens?

Say we put a sample /etc/hosts file into the tokenizer.

127.0.0.1	localhost
127.0.1.1	ha-laptop

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

It says this would parse to 75 tokens. The sample above has 189 characters meaning if we take their estimate of token = char / 4 we would get 47 tokens. If we look at the words we find it has 22 words so 22 / 0.75 = 29 tokens.

Can anyone please help explain why this is?

Special characters, such as “<>,.-” often takes one token. So more you have special characters the more you have tokens compared to text with alphabets.

Well, you are picking a “corner-case sample to quibble about”, @smahm

You can see that in your example, the words are not typical works you find in text and so you are “picking” on a special corner-case.

Not sure what is your point. If you need an accurate token count, you should use the Python TikToken library and get the exact number of tokens.

You are using a generalized rough guess method and applying it to a corner-case and then commenting on the lack of accuracy. Not sure why, to be honest.

Here is a “preferred method” to get tokens ('chat completionAPI example usingturbo`):

import tiktoken
import sys

def tik(words):
    encoding = tiktoken.get_encoding("cl100k_base")
    #encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    tokens = encoding.encode(words)
    return tokens

tmp = str(sys.argv[1])
# output token count 
print(len(tik(tmp)))
1 Like

Yeah apologies ill edit my post to be less quibbly. Thank you for the example, I will study the tiktoken library.

Two questions:

  1. Other documentation indicates the encoding_name for ChatGPT tokenizer is:

“gpt2” for tiktoken.get_encoding()

and

“text-davinci-003” for tiktoken.encoding_for_model(model)

What is “cl100k_base” and where is it referenced in the API documentation?

  1. Tiktoken with tiktoken.get_encoding(“cl100k_base”) was ~28 tokens off the count provided by a chatgpt completion endpoint error message (which returns the total number of requested tokens which allowed me to compare the token counts). Is TikToken the exact same tokenizer used by the endpoints or a very, very close similiar?

Thanks

I think OpenAI should provide an API endpoint for calculating tokens.

Give text input and model as parameters.

3 Likes

This was under-reporting the tokens and I actually asked GPT-4 what as wrong with the function. It immediately noticed that your method doesn’t account for punctuation. Here’s an updated version.

def estimate_tokens(text, method = "max")
  # method can be "average", "words", "chars", "max", "min", defaults to "max"
  # "average" is the average of words and chars
  # "words" is the word count divided by 0.75
  # "chars" is the char count divided by 4
  # "max" is the max of word and char
  # "min" is the min of word and char

  word_count = text.split(" ").count
  char_count = text.length
  tokens_count_word_est = word_count.to_f / 0.75
  tokens_count_char_est = char_count.to_f / 4.0

  # Include additional tokens for spaces and punctuation marks
  additional_tokens = text.scan(/[\s.,!?;]/).length

  tokens_count_word_est += additional_tokens
  tokens_count_char_est += additional_tokens

  output = 0
  if method == "average"
    output = (tokens_count_word_est + tokens_count_char_est) / 2
  elsif method == "words"
    output = tokens_count_word_est
  elsif method == "chars"
    output = tokens_count_char_est
  elsif method == 'max'
    output = [tokens_count_word_est, tokens_count_char_est].max
  elsif method == 'min'
    output = [tokens_count_word_est, tokens_count_char_est].min
  else
    # return invalid method message
    return "Invalid method. Use 'average', 'words', 'chars', 'max', or 'min'."
  end

  return output.to_i
end
2 Likes

For c#, there is a nuget package to count tokens:

dotnet add package AI.Dev.OpenAI.GPT --version 1.0.2

source code:

In c#:
public static int CountTokens(string input)
{
string pattern = @“/”“(?\.|[^”“\])*”“|'(?:[st]|re|ve|m|ll|d)| ?\p{L}+| ?\p{N}+| ?[^s\p{L}\p{N}]+|\s+(?!\S)|\s+”;
Regex regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.Multiline);
return regex.Matches(input).Count;
}

This uses the simpler regex provided by raymonddaveyDo you get billed extra when echo=true - #4 by curt.kennedy

Can’t say it’s perfect, but for my purposes, it works.

Hi there!
It seems like you need a way to manually calculate the token count of a prompt in a non-Python programming language. I’ve developed a C++ library with C export that can help you with this task. And there are few examples of using in different languages. The library can be used with various programming languages and should be helpful for your use case as well.

By using this library, you can calculate the token count for your prompts beforehand and make sure you don’t exceed the token limit for the model you’re using. I hope this helps! Please feel free to try it out and let me know if you have any questions or concerns. I’m here to help!

I think other people have more or less answered the question “How can I use existing tools in to count tokens/tokenize?”, and great props to those people for their practical contributions. I do however think the thread’s titular question hasn’t been directly answered. Here’s my understanding from reading over the source code in tiktoken, though I admit I may have an imperfect understanding, as I’m an amateur in Rust, which is the language that does the heavy lifting for tiktoken’s encoder.

Suppose you call the following code

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("hello world")

In short, the get_encoding function which is defined in the file tiktoken_ext/openai_public.py, which looks like this:

def cl100k_base():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
    )
    special_tokens = {
        ENDOFTEXT: 100257,
        FIM_PREFIX: 100258,
        FIM_MIDDLE: 100259,
        FIM_SUFFIX: 100260,
        ENDOFPROMPT: 100276,
    }
    return {
        "name": "cl100k_base",
        "pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": special_tokens,
    }

Notice the constructor returns pat_str, a regex that others here have mentioned, as well as two collections: special_tokens and mergeable_ranks. special_tokens is just a list of flags that get packaged with your query to help the language model understand where its job begins and end, and these get handled separately. mergeable_ranks holds the contents of a tiktoken-specific formatted table which is loaded from the web. The rows in said table look like this:

IQ== 0
Ig== 1
Iw== 2
JA== 3
JQ== 4
Jg== 5
...
IGRldGVybWluaXN0aWM= 73449
IFVOSVFVRQ== 73450
IGV0dMOk 73451
U2luZ2xlTm9kZQ== 73452
...

As you can probably tell, these just assign some notion of precedence of which subsequences get merged into tokens first. These rankings are presumably learned in order to efficiently encode text into as few tokens as possible, but I imagine the learning algorithm is a guarded trade secret. In any case, judging by the uncommon looking sequences further down on the table, I bet that lower-ranked subsequences get merged first.

Getting back to the flow of our function call, get_encoding then pushes the cl100k_base constructor down a pipeline which extracts these components and hands them to the “actual algorithm” which lives in a program compiled from src/lib.rs. Calling enc.encode("Hello World") will run this algorithm. After the special tokens are filtered out, we end up with the algorithm control flow being defined by the following function

fn _encode_ordinary_native(&self, text: &str) -> Vec<usize> {
        // This is the core of the encoding logic; the other functions in here
        // just make things complicated :-)
        let regex = self._get_tl_regex();
        let mut ret = vec![];
        for mat in regex.find_iter(text) {
            let piece = mat.unwrap().as_str().as_bytes();
            if let Some(token) = self.encoder.get(piece) {
                ret.push(*token);
                continue;
            }
            ret.extend(&byte_pair_encode(piece, &self.encoder));
        }
        ret
    }

So this function first uses the regex to split the text into pieces. It then loops through those pieces. For each piece, it checks if that piece is a predefined token, and if so, it appends that to the end of ret, a token vector. If the piece is not a token, the function this piece into byte_pair_encode, which will parse that piece into one or more tokens and append them to the end of ret.

pub fn byte_pair_encode(piece: &[u8], ranks: &HashMap<Vec<u8>, usize>) -> Vec<usize> {
    if piece.len() == 1 {
        return vec![ranks[piece]];
    }
    _byte_pair_merge(piece, ranks, |p| ranks[&piece[p.start..p.end]])
}

This function has two cases. If the piece has length one, it will return vec![ranks[piece]], basically indicating the piece comprises a single token, and its numerical “token value” is simply its rank from the table. Othewise, it will return the value of _byte_pair_merge(piece, ranks, |p| ranks[&piece[p.start..p.end]]) which specifically parses the piece into multiple tokens. That function is pretty long and complicated, so I won’t paste it here, but as best as I can tell, it splits the piece into a vector of byte pairs, called parts. Then it greedily merges those parts into tokens based on the ranks from left to right iteratively over many scans until no further merges can be made.

TL;DR:

  1. Split string with regex into pieces.
  2. Some pieces get mapped to tokens immediately.
  3. Other pieces are split into byte pairs.
  4. Open AI hands us a table of rankings for sequences of byte pairs. The lower the rank, the more we like the sequence.
  5. The byte pairs in a piece are merged according to that table, always choosing the lowest rank merge possible.
  6. Merged byte pairs are finally mapped to tokens.
  7. Now you have a vector of tokens. Count them if you want to.
2 Likes

Note that the tokenizer at OpenAI Platform is not correct for ChatGPT or latest API. I didn’t realize this until just now.

For example, You miss 100% of the shots you don’t take:

  • Tokenizer with GPT-3 selected says 13 tokens:
    • You, miss, 100, %, of, the, shots, you, don, , , t, take
    • [1639, 2051, 1802, 4, 286, 262, 6934, 345, 836, 447, 247, 83, 1011]
  • Tiktoken with cl100k_base says 12 tokens:
    • You, miss, , 100, %, of, the, shots, you, don, ’t, take
    • [2675, 3194, 220, 1041, 4, 315, 279, 15300, 499, 1541, 1431, 1935]
    • This is the tokenizer for ChatGPT, gpt-3.5-turbo, and gpt-4.
  • Tiktoken with r50k_base says 13 tokens:
    • You, miss, 100, %, of, the, shots, you, don, , , t, take
    • [1639, 2051, 1802, 4, 286, 262, 6934, 345, 836, 447, 247, 83, 1011]
    • This is the tokenizer for GPT-3 models like davinci, etc.

(This forum appears to be stripping leading spaces and trailing spaces inside code tags.)

For example, if you look inside cl100k_base.tiktoken, and find the line:

IHNob3Rz 15300

it base64-decodes to shots, while r50k_base.tiktoken contains:

IHNob3Rz 6934

which is the same base64-encoded string.

I noticed this a while back. Any idea what tokenizer OpenAI’s tool is using. It says it’s the tokenizer for GPT-3, which should be either p50k_base or r50k_base, but I do not get the same token count when calculating tokens using tiktoken in python (in a Google Colab notebook), as I do when I put the same text string into the OpenAI website.

For a given sample, I get 480 tokens from cl100k_base, 485 from either p50k or r50k, and around 503 from the website. That means the website seems to correspond to no known tokenizer encoding base supported by tiktoken, but it even shows you in delightful color-coded chunks what it’s doing to the text.

Very odd. It doesn’t really matter for me b/c I do a lot of token calculations prior to sending the API calls, but also generations are not of predictable length even with token limits, so I make sure there’s “padding” around my prompts so that token limits aren’t exceeded. Or, more recently I’m just using 3.5-turbo-16k but with about 5k input tokens, as that tends to yield the best results for my needs.

The “GPT-3” option should be r50k_base and the “Codex” option should be p50k_base. Both models are obsolete, though.

Note that p50k_base overlaps substantially with r50k_base, and for non-code applications, they will usually give the same tokens.

Oh, weird. For what string?

I was testing with a vector store retrieval chunk after I realized the OpenAI website tokenizer always over-reports tokens by like 3% compared to what Tiktoken was giving me.

text = “1078\n\nFinally, the district court, and the dissent, place particular weight on a brief interchange during plaintiffs’ cross-examination of one of the NCAA’s witnesses, Neal Pilson, a television sports consultant formerly employed at CBS. Pilson testified that "if you’re paid for your performance, you’re not an amateur," and explained at length why paying students would harm the student-athlete market. Plaintiffs then asked Pilson whether his opinions about amateurism "depend on the level of the money" paid to players, and he acknowledged *1078 that his opinion was "impacted by the level." When asked whether there was a line that "should not be crossed" in paying players, Pilson responded "that’s a difficult question. I haven’t thought about the line. And I haven’t been asked to render an opinion on that." When pressed to come up with a figure, Pilson repeated that he was "not sure." He eventually commented that "I tell you that a million dollars would trouble me and $5,000 wouldn’t, but that’s a pretty good range." When asked whether deferred compensation to students would concern him, Pilson said that while he would not be as concerned by deferred payments, he would still be "troubled by it."[22]\n\nSo far as we can determine, Pilson’s offhand comment under cross-examination is the sole support for the district court’s $5,000 figure. But even taking Pilson’s comments at face value, as the dissent urges, his testimony cannot support the finding that paying student-athletes small sums will be virtually as effective in preserving amateurism as not paying them. Pilson made clear that he was not prepared to opine on whether pure cash compensation, of any amount, would affect amateurism. Indeed, he was never asked about the impact of giving student-athletes small cash payments; instead, like other witnesses, he was asked only whether big payments would be worse than small payments. Pilson’s casual comment \u2014 "[I] haven’t been asked to render an opinion on that. It’s not in my report" \u2014 that he would not be troubled by $5,000 payments is simply not enough to support the district court’s far-reaching conclusion that paying students $5,000 per year will be as effective in preserving amateurism as the NCAA’s current policy.[”

feeding just what’s between the “” to the tokenizer.

yeah, for my prototyping, the only base that matters is cl100k_base, and then I’m not keeping precise token counts for Anthropic and Google Palm2, but figure that I’m within ~5% using the cl100k tokenizer.