What is the OpenAI algorithm to calculate tokens?

I also reverse engineered the tokenizer on the playground

It is written in JavaScript. I converted it to c#

2 Likes

Why not post it and share ?

That would be helpful, I think.

Sharing code in a developer community helps everyone in the community.

:slight_smile:

Thank you, it surely helps, but I need exact :confused:

But thank you, you’ve been a champ … :slight_smile:

THAT would be SUPER interesting, as in getting the C# version - Assuming it’s 100% “perfect” in its calculations :slight_smile:

I’m happy to release the c# version

There are differences that I have not nailed down. I am using the exact same token libraries .another member gave me a suggestion I never got time to follow through with

I’ll extract it from my main app and package it up for others to use soon. (It’s embedded in my larger app at the moment)

Thank you!

We need a new OpenAI community rule, which I plan to propose to @logankilpatrick that all Q&A should take place in the community and not via email or private, direct messages; and we need to encourage everyone to post code because this is a developer community and the language of developers is code.

This type of “Q and A” tech forum rule that all Q & A should occur in the public forums is normal for tech communities because the purpose of having a public community is to share code and work on developer solutions in public to create a knowledge base for current and future users.

1 Like

I don’t think you will find any tokenizer which is “100%” perfect at counting tokens all the time, @polterguy ; writing a lot of code and I find the token approximates work fine especially because you cannot control exactly how the GPT model will reply and exactly how many tokens they will use in a reply.

Of course, I would love to see code which does not use an API to estimate GPT-2 token count, because I prefer to estimate tokens without making another network API call because the more network API calls made, the more the chance for errors, costs and delays.

According to the GPT-3 docs I have read, using a GPT-2 tokenizer is also an approximation for GPT-3 (not exact) and the most recommended way to get a GPT-3 token count is to submit an API call to a GPT-3 endpoint.

Hence, what is why OpenAI offers the “rules to estimate” token count, I think at least.

HTH

Here is the correct link

Previous post deleted to avoid confusion

I couldn’t agree more. I’ve got 10K commits to GitHub the last 5 years or so, and everything we’re doing in the AI space is 100% Open Sauce ==> GitHub - polterguy/magic: Generate a web app in seconds with Low-Code and AI

So I totally agree!

OpenAI tokenizer

Respect!

I am very biased toward coders and wish more people here would “speak” through code and working examples!

After all, this is supposed to be a community of developers / coders …

Say we put a sample /etc/hosts file into the tokenizer.

127.0.0.1	localhost
127.0.1.1	ha-laptop

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

It says this would parse to 75 tokens. The sample above has 189 characters meaning if we take their estimate of token = char / 4 we would get 47 tokens. If we look at the words we find it has 22 words so 22 / 0.75 = 29 tokens.

Can anyone please help explain why this is?

Special characters, such as “<>,.-” often takes one token. So more you have special characters the more you have tokens compared to text with alphabets.

Well, you are picking a “corner-case sample to quibble about”, @smahm

You can see that in your example, the words are not typical works you find in text and so you are “picking” on a special corner-case.

Not sure what is your point. If you need an accurate token count, you should use the Python TikToken library and get the exact number of tokens.

You are using a generalized rough guess method and applying it to a corner-case and then commenting on the lack of accuracy. Not sure why, to be honest.

Here is a “preferred method” to get tokens ('chat completionAPI example usingturbo`):

import tiktoken
import sys

def tik(words):
    encoding = tiktoken.get_encoding("cl100k_base")
    #encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    tokens = encoding.encode(words)
    return tokens

tmp = str(sys.argv[1])
# output token count 
print(len(tik(tmp)))
1 Like

Yeah apologies ill edit my post to be less quibbly. Thank you for the example, I will study the tiktoken library.

Two questions:

  1. Other documentation indicates the encoding_name for ChatGPT tokenizer is:

“gpt2” for tiktoken.get_encoding()

and

“text-davinci-003” for tiktoken.encoding_for_model(model)

What is “cl100k_base” and where is it referenced in the API documentation?

  1. Tiktoken with tiktoken.get_encoding(“cl100k_base”) was ~28 tokens off the count provided by a chatgpt completion endpoint error message (which returns the total number of requested tokens which allowed me to compare the token counts). Is TikToken the exact same tokenizer used by the endpoints or a very, very close similiar?

Thanks

I think OpenAI should provide an API endpoint for calculating tokens.

Give text input and model as parameters.

3 Likes

This was under-reporting the tokens and I actually asked GPT-4 what as wrong with the function. It immediately noticed that your method doesn’t account for punctuation. Here’s an updated version.

def estimate_tokens(text, method = "max")
  # method can be "average", "words", "chars", "max", "min", defaults to "max"
  # "average" is the average of words and chars
  # "words" is the word count divided by 0.75
  # "chars" is the char count divided by 4
  # "max" is the max of word and char
  # "min" is the min of word and char

  word_count = text.split(" ").count
  char_count = text.length
  tokens_count_word_est = word_count.to_f / 0.75
  tokens_count_char_est = char_count.to_f / 4.0

  # Include additional tokens for spaces and punctuation marks
  additional_tokens = text.scan(/[\s.,!?;]/).length

  tokens_count_word_est += additional_tokens
  tokens_count_char_est += additional_tokens

  output = 0
  if method == "average"
    output = (tokens_count_word_est + tokens_count_char_est) / 2
  elsif method == "words"
    output = tokens_count_word_est
  elsif method == "chars"
    output = tokens_count_char_est
  elsif method == 'max'
    output = [tokens_count_word_est, tokens_count_char_est].max
  elsif method == 'min'
    output = [tokens_count_word_est, tokens_count_char_est].min
  else
    # return invalid method message
    return "Invalid method. Use 'average', 'words', 'chars', 'max', or 'min'."
  end

  return output.to_i
end
1 Like

For c#, there is a nuget package to count tokens:

dotnet add package AI.Dev.OpenAI.GPT --version 1.0.2

source code:

In c#:
public static int CountTokens(string input)
{
string pattern = @“/”“(?\.|[^”“\])*”“|'(?:[st]|re|ve|m|ll|d)| ?\p{L}+| ?\p{N}+| ?[^s\p{L}\p{N}]+|\s+(?!\S)|\s+”;
Regex regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.Multiline);
return regex.Matches(input).Count;
}

This uses the simpler regex provided by raymonddaveyDo you get billed extra when echo=true - #4 by curt.kennedy

Can’t say it’s perfect, but for my purposes, it works.