Say we put a sample /etc/hosts file into the tokenizer.
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
It says this would parse to 75 tokens. The sample above has 189 characters meaning if we take their estimate of token = char / 4 we would get 47 tokens. If we look at the words we find it has 22 words so 22 / 0.75 = 29 tokens.
Other documentation indicates the encoding_name for ChatGPT tokenizer is:
“gpt2” for tiktoken.get_encoding()
“text-davinci-003” for tiktoken.encoding_for_model(model)
What is “cl100k_base” and where is it referenced in the API documentation?
Tiktoken with tiktoken.get_encoding(“cl100k_base”) was ~28 tokens off the count provided by a chatgpt completion endpoint error message (which returns the total number of requested tokens which allowed me to compare the token counts). Is TikToken the exact same tokenizer used by the endpoints or a very, very close similiar?