Say we put a sample /etc/hosts file into the tokenizer.
127.0.0.1 localhost
127.0.1.1 ha-laptop
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
It says this would parse to 75 tokens. The sample above has 189 characters meaning if we take their estimate of token = char / 4 we would get 47 tokens. If we look at the words we find it has 22 words so 22 / 0.75 = 29 tokens.
Special characters, such as “<>,.-” often takes one token. So more you have special characters the more you have tokens compared to text with alphabets.
Well, you are picking a “corner-case sample to quibble about”, @smahm
You can see that in your example, the words are not typical works you find in text and so you are “picking” on a special corner-case.
Not sure what is your point. If you need an accurate token count, you should use the Python TikToken library and get the exact number of tokens.
You are using a generalized rough guess method and applying it to a corner-case and then commenting on the lack of accuracy. Not sure why, to be honest.
Here is a “preferred method” to get tokens ('chat completionAPI example usingturbo`):
Other documentation indicates the encoding_name for ChatGPT tokenizer is:
“gpt2” for tiktoken.get_encoding()
and
“text-davinci-003” for tiktoken.encoding_for_model(model)
What is “cl100k_base” and where is it referenced in the API documentation?
Tiktoken with tiktoken.get_encoding(“cl100k_base”) was ~28 tokens off the count provided by a chatgpt completion endpoint error message (which returns the total number of requested tokens which allowed me to compare the token counts). Is TikToken the exact same tokenizer used by the endpoints or a very, very close similiar?