After release gtp-4o, I found that it uses new tokenization algorithm. So what’s the new tokenization algorithm for gpt-4o?
The encoding name of the talkerizer corresponding to gpt-4o seems to be “o200k_base”.
Thanks.
I’ve found the encoding file: https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
For those that need it, I’ve updated the OpenAI cookbook here:
And the usage function…
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
"""Return the number of tokens used by a list of messages."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
print("Warning: model not found. Using cl100k_base encoding.")
encoding = tiktoken.get_encoding("cl100k_base")
if model in {
"gpt-3.5-turbo-0613",
"gpt-3.5-turbo-16k-0613",
"gpt-4-0314",
"gpt-4-32k-0314",
"gpt-4-0613",
"gpt-4-32k-0613",
"gpt-4-turbo",
"gpt-4-turbo-2024-04-09",
"gpt-4o",
"gpt-4o-2024-05-13",
}:
tokens_per_message = 3
tokens_per_name = 1
elif model == "gpt-3.5-turbo-0301":
tokens_per_message = 4 # every message follows <|start|>{role/name}\n{content}<|end|>\n
tokens_per_name = -1 # if there's a name, the role is omitted
elif "gpt-3.5-turbo" in model:
print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
elif "gpt-4o" in model:
print(
"Warning: gpt-4o may update over time. Returning num tokens assuming gpt-4o-2024-05-13.")
return num_tokens_from_messages(messages, model="gpt-4o-2024-05-13")
elif "gpt-4" in model:
print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
return num_tokens_from_messages(messages, model="gpt-4-0613")
else:
raise NotImplementedError(
f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
)
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
return num_tokens
What do you mean “the new algorithm”?! We do NOT know what the old one was and OpenAI surely will not document this.
IF you are asking just about the tokenization of a string, that should be straight forward. Just get the tokens and chop away!
But if you ACTUALLY want to do something a tiny bit more complex (like functions/tools), you’re out of luck.
Nobody really know how they do it. OpenAI themselves don’t really know.
That’s why people have been trying (successfully) to reverse engineer the algorithm that does the tokenization of a full http request content.