Tokenizers for davinci-002 and babbage-002

I want to measure the token length of the inputs Iโ€™m sending to davinci-002 and babbage-002.

Which tiktoken encoding should I use for this?

When I try tiktoken.encoding_for_model('davinci-002'), it raises

KeyError: 'Could not automatically map davinci-002 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'

even with the latest tiktoken version.

Update: For anyone else wondering the same thing, Iโ€™m now pretty sure the encoding is cl100k_base.

I wrote this script to check the token lengths returned by the API for these models, on prompts whose token lengths differ between encodings:

# adapted from 
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

import openai
import tiktoken
from pprint import pprint

example_strings = [
    "antidisestablishmentarianism",
    "2 + 2 = 4",
    "ใŠ่ช•็”Ÿๆ—ฅใŠใ‚ใงใจใ†",
]

def get_encoding_lengths(example_string: str) -> None:
    results = {}
    for encoding_name in ["gpt2", "p50k_base", "cl100k_base"]:
        encoding = tiktoken.get_encoding(encoding_name)
        token_integers = encoding.encode(example_string)
        num_tokens = len(token_integers)
        results[encoding_name] = num_tokens
    return {example_string: results}


comparison_dict = {}

for example_string in example_strings:
    comparison_dict.update(get_encoding_lengths(example_string))
    
    for model in [
        'babbage-002',
        'davinci-002'
    ]:
        response = openai.Completion.create(
            model=model,
            prompt=example_string,
            max_tokens=0,
            logprobs=0,
            echo=True,
        )

        comparison_dict[example_string][model] = response['usage']['prompt_tokens']
        
pprint(comparison_dict)

It prints

{'2 + 2 = 4': {'babbage-002': 7,
               'cl100k_base': 7,
               'davinci-002': 7,
               'gpt2': 5,
               'p50k_base': 5},
 'antidisestablishmentarianism': {'babbage-002': 6,
                                  'cl100k_base': 6,
                                  'davinci-002': 6,
                                  'gpt2': 5,
                                  'p50k_base': 5},
 'ใŠ่ช•็”Ÿๆ—ฅใŠใ‚ใงใจใ†': {'babbage-002': 9,
               'cl100k_base': 9,
               'davinci-002': 9,
               'gpt2': 14,
               'p50k_base': 14}}

The token lengths for the two models match the cl100k_base token lengths, not the lengths from the other encodings.

2 Likes

All current models use cl100k_base.

1 Like