Tokenizers for davinci-002 and babbage-002

rob-galileo · August 23, 2023, 3:35pm

I want to measure the token length of the inputs I’m sending to davinci-002 and babbage-002.

Which tiktoken encoding should I use for this?

When I try tiktoken.encoding_for_model('davinci-002'), it raises

KeyError: 'Could not automatically map davinci-002 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'

even with the latest tiktoken version.

rob-galileo · August 23, 2023, 4:48pm

Update: For anyone else wondering the same thing, I’m now pretty sure the encoding is cl100k_base.

I wrote this script to check the token lengths returned by the API for these models, on prompts whose token lengths differ between encodings:

# adapted from 
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

import openai
import tiktoken
from pprint import pprint

example_strings = [
    "antidisestablishmentarianism",
    "2 + 2 = 4",
    "お誕生日おめでとう",
]

def get_encoding_lengths(example_string: str) -> None:
    results = {}
    for encoding_name in ["gpt2", "p50k_base", "cl100k_base"]:
        encoding = tiktoken.get_encoding(encoding_name)
        token_integers = encoding.encode(example_string)
        num_tokens = len(token_integers)
        results[encoding_name] = num_tokens
    return {example_string: results}


comparison_dict = {}

for example_string in example_strings:
    comparison_dict.update(get_encoding_lengths(example_string))
    
    for model in [
        'babbage-002',
        'davinci-002'
    ]:
        response = openai.Completion.create(
            model=model,
            prompt=example_string,
            max_tokens=0,
            logprobs=0,
            echo=True,
        )

        comparison_dict[example_string][model] = response['usage']['prompt_tokens']
        
pprint(comparison_dict)

It prints

{'2 + 2 = 4': {'babbage-002': 7,
               'cl100k_base': 7,
               'davinci-002': 7,
               'gpt2': 5,
               'p50k_base': 5},
 'antidisestablishmentarianism': {'babbage-002': 6,
                                  'cl100k_base': 6,
                                  'davinci-002': 6,
                                  'gpt2': 5,
                                  'p50k_base': 5},
 'お誕生日おめでとう': {'babbage-002': 9,
               'cl100k_base': 9,
               'davinci-002': 9,
               'gpt2': 14,
               'p50k_base': 14}}

The token lengths for the two models match the cl100k_base token lengths, not the lengths from the other encodings.

anon22939549 · August 23, 2023, 4:57pm

All current models use cl100k_base.

Topic		Replies	Views
Could not automatically map text-embedding-3-small to a tokeniser API	2	4543	February 14, 2024
What's the GPT-4-Turbo encoding? API gpt-4 , token , gpt-4-turbo	3	4709	November 15, 2023
Fine-tune tokens lower than expected API fine-tuning , token , fine-tuning-problems	4	1130	December 8, 2023
Is there a way to make a tokenizer using tiktoken lib API api	0	214	September 21, 2024
Prompt_tokens vs tiktoken.encoding_for_model().encode() Prompting gpt-35-turbo , token	4	5632	August 3, 2023

Tokenizers for davinci-002 and babbage-002

Related topics