What is the OpenAI algorithm to calculate tokens?

Say we put a sample /etc/hosts file into the tokenizer.	localhost	ha-laptop

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

It says this would parse to 75 tokens. The sample above has 189 characters meaning if we take their estimate of token = char / 4 we would get 47 tokens. If we look at the words we find it has 22 words so 22 / 0.75 = 29 tokens.

Can anyone please help explain why this is?

Special characters, such as “<>,.-” often takes one token. So more you have special characters the more you have tokens compared to text with alphabets.

Well, you are picking a “corner-case sample to quibble about”, @smahm

You can see that in your example, the words are not typical works you find in text and so you are “picking” on a special corner-case.

Not sure what is your point. If you need an accurate token count, you should use the Python TikToken library and get the exact number of tokens.

You are using a generalized rough guess method and applying it to a corner-case and then commenting on the lack of accuracy. Not sure why, to be honest.

Here is a “preferred method” to get tokens ('chat completionAPI example usingturbo`):

import tiktoken
import sys

def tik(words):
    encoding = tiktoken.get_encoding("cl100k_base")
    #encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    tokens = encoding.encode(words)
    return tokens

tmp = str(sys.argv[1])
# output token count 
1 Like

Yeah apologies ill edit my post to be less quibbly. Thank you for the example, I will study the tiktoken library.

Two questions:

  1. Other documentation indicates the encoding_name for ChatGPT tokenizer is:

“gpt2” for tiktoken.get_encoding()


“text-davinci-003” for tiktoken.encoding_for_model(model)

What is “cl100k_base” and where is it referenced in the API documentation?

  1. Tiktoken with tiktoken.get_encoding(“cl100k_base”) was ~28 tokens off the count provided by a chatgpt completion endpoint error message (which returns the total number of requested tokens which allowed me to compare the token counts). Is TikToken the exact same tokenizer used by the endpoints or a very, very close similiar?


I think OpenAI should provide an API endpoint for calculating tokens.

Give text input and model as parameters.