Tokenizer - Latest Chat GPT Models

Hi everyone,

I am looking at the Tokenizer page on the OpenAI Website: https://platform.openai.com/tokenizer, super useful tool.

I can see I can select up to GPT-4o and 4o mini models, however, does anyone know if this still applies to the latest models, or there is a newer page with the current models on?

Thanks!

2 Likes

Oh, I thought it was removed. Didn’t know it still exits.

Hi @ollie266,

Thanks for taking the time to report this. It seems that currently, it only applies to the models listed there plus models that use the o200k-base encodings, which is what gpt-4o uses.

I will keep this post updated in case of any developments.

2 Likes

gpt-4o-2024-05-03 through the latest chat models all have used o200k_base. Thus, a choice of “gpt-4o” on OpenAI’s page is what you need to pick unless using models with the former cl100k (gpt-4-turbo-2024-04-09 and before).

token_encoder = (
  "cl100k_base"
  if (
    model == "gpt-4"
    or model.removeprefix("ft:").startswith(
      ("gpt-3", "gpt-4-turbo", "gpt-4-")
    )
  )
  else "o200k_base"
)

“text-embedding…” models are cl100k_base for measuring how much you can send, until a newer model were released.

This alternate site and my link to it makes the token encoder by name clear, and also provides token numbers (except special tokens used internally are supposed wrong there)

2 Likes

Quick follow-up: the Tokenizer page is now updated to support the latest models available through the API.

3 Likes

The first tab is now renamed to not show gpt-4o is what happened.