I want to measure the token length of the inputs Iโm sending to davinci-002
and babbage-002
.
Which tiktoken
encoding should I use for this?
When I try tiktoken.encoding_for_model('davinci-002')
, it raises
KeyError: 'Could not automatically map davinci-002 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'
even with the latest tiktoken
version.
Update: For anyone else wondering the same thing, Iโm now pretty sure the encoding is cl100k_base
.
I wrote this script to check the token lengths returned by the API for these models, on prompts whose token lengths differ between encodings:
# adapted from
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
import openai
import tiktoken
from pprint import pprint
example_strings = [
"antidisestablishmentarianism",
"2 + 2 = 4",
"ใ่ช็ๆฅใใใงใจใ",
]
def get_encoding_lengths(example_string: str) -> None:
results = {}
for encoding_name in ["gpt2", "p50k_base", "cl100k_base"]:
encoding = tiktoken.get_encoding(encoding_name)
token_integers = encoding.encode(example_string)
num_tokens = len(token_integers)
results[encoding_name] = num_tokens
return {example_string: results}
comparison_dict = {}
for example_string in example_strings:
comparison_dict.update(get_encoding_lengths(example_string))
for model in [
'babbage-002',
'davinci-002'
]:
response = openai.Completion.create(
model=model,
prompt=example_string,
max_tokens=0,
logprobs=0,
echo=True,
)
comparison_dict[example_string][model] = response['usage']['prompt_tokens']
pprint(comparison_dict)
It prints
{'2 + 2 = 4': {'babbage-002': 7,
'cl100k_base': 7,
'davinci-002': 7,
'gpt2': 5,
'p50k_base': 5},
'antidisestablishmentarianism': {'babbage-002': 6,
'cl100k_base': 6,
'davinci-002': 6,
'gpt2': 5,
'p50k_base': 5},
'ใ่ช็ๆฅใใใงใจใ': {'babbage-002': 9,
'cl100k_base': 9,
'davinci-002': 9,
'gpt2': 14,
'p50k_base': 14}}
The token lengths for the two models match the cl100k_base
token lengths, not the lengths from the other encodings.
2 Likes
All current models use cl100k_base
.
1 Like