What's the new tokenization algorithm for gpt-4o?

forestwanglin · May 14, 2024, 7:02am

After release gtp-4o, I found that it uses new tokenization algorithm. So what’s the new tokenization algorithm for gpt-4o?

dignity_for_all · May 14, 2024, 7:06am

The encoding name of the talkerizer corresponding to gpt-4o seems to be “o200k_base”.

forestwanglin · May 14, 2024, 7:26am

Thanks.

I’ve found the encoding file: https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken

jorgeintegrait · June 26, 2024, 10:48pm

For those that need it, I’ve updated the OpenAI cookbook here:

github.com/openai/openai-cookbook

Update tiktoken models with gpt-4-turbo and gpt-4o

openai:main ← jalcantarab:update-tiktoken-models

opened 10:41PM - 26 Jun 24 UTC

jalcantarab

+31 -6

## Summary This PR updates the `How_to_count_tokens_with_tiktoken.ipynb` note…book to include the latest models released since `gpt-4-turbo`. The updates ensure compatibility with `gpt-4-turbo`, `gpt-4o`, and their respective variants, along with small corrections on the number of encoders and examples. ## Motivation These changes are necessary to keep the notebook up-to-date with the latest OpenAI models. By including the new models and their respective encodings, users can accurately count tokens for all supported models, enhancing the utility and accuracy of the notebook. --- ## For new content When contributing new content, read through our [contribution guidelines](https://github.com/openai/openai-cookbook/blob/main/CONTRIBUTING.md), and mark the following action items as completed: - [x] I have added a new entry in [registry.yaml](https://github.com/openai/openai-cookbook/blob/main/registry.yaml) (and, optionally, in [authors.yaml](https://github.com/openai/openai-cookbook/blob/main/authors.yaml)) so that my content renders on the cookbook website. - [x] I have conducted a self-review of my content based on the [contribution guidelines](https://github.com/openai/openai-cookbook/blob/main/CONTRIBUTING.md#rubric): - [x] Relevance: This content is related to building with OpenAI technologies and is useful to others. - [x] Uniqueness: I have searched for related examples in the OpenAI Cookbook, and verified that my content offers new insights or unique information compared to existing documentation. - [x] Spelling and Grammar: I have checked for spelling or grammatical mistakes. - [x] Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand. - [x] Correctness: The information I include is correct and all of my code executes successfully. - [x] Completeness: I have explained everything fully, including all necessary references and citations.

And the usage function…

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model in {
        "gpt-3.5-turbo-0613",
        "gpt-3.5-turbo-16k-0613",
        "gpt-4-0314",
        "gpt-4-32k-0314",
        "gpt-4-0613",
        "gpt-4-32k-0613",
        "gpt-4-turbo",
        "gpt-4-turbo-2024-04-09",
        "gpt-4o",
        "gpt-4o-2024-05-13",
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
    elif "gpt-4o" in model:
        print(
            "Warning: gpt-4o may update over time. Returning num tokens assuming gpt-4o-2024-05-13.")
        return num_tokens_from_messages(messages, model="gpt-4o-2024-05-13")
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return num_tokens_from_messages(messages, model="gpt-4-0613")
    else:
        raise NotImplementedError(
            f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
        )
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

Topic		Replies	Views
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	27747	December 13, 2023
Fixed prompt token limit exceeded error for long conversation gpt-3 and gpt-4 API gpt-4 , gpt-35-turbo , chatgpt , gpt4-error , error	3	9299	December 17, 2023
What is the reason for adding total 7 tokens? API chatgpt , api	12	3951	December 11, 2023
How to see token per message for gpt models? Feedback gpt-4 , chatgpt	0	164	July 1, 2024
How does ChatML do the exact formatting? API	3	8097	June 6, 2023

What's the new tokenization algorithm for gpt-4o?

Related topics