What is the OpenAI algorithm to calculate tokens?

polterguy · February 12, 2023, 10:35am

I know you have libraries helping out in Python and such, but I’m using an “esoteric” programming language to interact with the API, and I need to know how I can manually calculate how many tokens a prompt will result in. I’ve tried length of string and divide by four, and it doesn’t work. I’ve tried byte count (to accommodate for UTF8) and it doesn’t work.

Does anyone here know the algorithm to calculate prompt length?

My reasons for asking is that I sometimes have large prompts, and I need to figure out max_tokens as I am invoking the prompt HTTP endpoint, which of course cannot exceed the tokens for whatever model I am using?

ruby_coder · February 12, 2023, 10:38am

Here ya go @polterguy

https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

Hope this helps!

polterguy · February 13, 2023, 4:20am

Thx @ruby_coder - It helps me understand tokens, but it doesn’t provide the algorithm. I need to know exactly how many tokens a specified string generates in the model, and I am not using NodeJS or Python.

ruby_coder · February 13, 2023, 4:26am

Hi @polterguy

Following up, I wrote this Ruby method (based on the reference above) which you might be able to convert to your favorite programming language.

Method to estimate tokens per OpenAI docs with method options

def self.estimate_tokens(text,method="max")
        # method can be "average", "words", "chars", "max", "min", defaults to "max"
        # "average" is the average of words and chars
        # "words" is the word count divided by 0.75
        # "chars" is the char count divided by 4
        # "max" is the max of word and char\
        # "min" is the min of word and char
        word_count = text.split(" ").count
        char_count = text.length
        tokens_count_word_est = word_count.to_f / 0.75
        tokens_count_char_est = char_count.to_f / 4.0
        output = 0
        if method == "average"
            output = (tokens_count_word_est + tokens_count_char_est) / 2
        elsif method == "words"
            output = tokens_count_word_est
        elsif method == "chars"
            output = tokens_count_char_est
        elsif method == 'max'
            output = [tokens_count_word_est,tokens_count_char_est].max
        elsif method == 'min'
            output = [tokens_count_word_est,tokens_count_char_est].min
        else
            # return invalid method message
            return "Invalid method. Use 'average', 'words', 'chars', 'max', or 'min'."
        end
       return  output.to_i
end

Examples

text="Curie is extremely powerful, yet very fast. While Davinci is stronger when it comes to analyzing complicated text, Curie is quite capable for many nuanced tasks like sentiment classification and summarization. Curie is also quite good at answe`ring questions and performing Q&A and as a general service chatbot."

estimate_tokens(text,method="words")
=> 64

estimate_tokens(text,method="chars")
=> 77

estimate_tokens(text,method="average")
=> 70

estimate_tokens(text,method="max")
=> 77

 estimate_tokens(text,method="min")
=> 64

 estimate_tokens(text)
=> 77

 estimate_tokens(text,method="bonehead")
=> "Invalid method. Use 'average', 'words', 'chars', 'max', or 'min'."

Modify and improve as you wish!

Hope this helps.

ruby_coder · February 13, 2023, 4:28am

Yes, it does, these are algorithms, per the document. They are algorithms to estimate tokens count:

1 token ~= 4 chars in English
1 token ~= ¾ words
100 tokens ~= 75 words

In the method I posted above (to help you @polterguy) I only used two criteria:

1 token ~= 4 chars in English
1 token ~= ¾ words

You can modify as you like.

HTH

Note:

“An algorithm is a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.”

These are OpenAI GPT “rules” to estimate token count, per the doc provided above, so they are a type of algorithm and is what I use in my code base I coded during breakfast to assist you.

1 token ~= 4 chars in English
1 token ~= ¾ words

raymonddavey · February 13, 2023, 4:51am

There is a much more accurate way using the gpt2 tokenizer library. I’ll find the link. Then you will get close to the same as openai use. (No need to estimate)

ruby_coder · February 13, 2023, 4:52am

Also note, if you take the examples in the OpenAI docs:

Wayne Gretzky’s quote “You miss 100% of the shots you don’t take” contains 11 tokens.

So, let:

 text="You miss 100% of the shots you don't take"

estimate_tokens(text,"average")
=> 11

estimate_tokens(text,"min")
=> 10

estimate_tokens(text,"max")
=> 12

So, one could think, for this very small sample of text, that “average” is closer to how OpenAI get a final count.

However, of course, more trials and tests are needed; but I think for many developers, “average” is a good guess; but if you want to be conservative, use “max”.

Hope this helps.

ruby_coder · February 13, 2023, 4:54am

Hopefully, the link you plan to provide has code which developers can download / copy and use in their projects.

If so, I will add it and test.

Last time I checked, the GPT2 tokenizer was not “perfectly accurate” either; but that was using a third-party on-line form, which I cannot include in code as a developer.

raymonddavey · February 13, 2023, 4:56am

github.com

openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to count tokens with tiktoken\n",
    "\n",
    "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
    "\n",
    "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
    "\n",
    "Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
    "\n",
    "`tiktoken` supports three encodings used by OpenAI models:\n",
    "\n",
    "| Encoding name           | OpenAI models                                       |\n",
    "|-------------------------|-----------------------------------------------------|\n",
    "| `gpt2` (or `r50k_base`) | Most GPT-3 models                                   |\n",

This file has been truncated. show original

raymonddavey · February 13, 2023, 4:57am

I also reverse engineered the tokenizer on the playground

It is written in JavaScript. I converted it to c#

ruby_coder · February 13, 2023, 4:59am

Why not post it and share ?

That would be helpful, I think.

Sharing code in a developer community helps everyone in the community.

polterguy · February 13, 2023, 5:15am

Thank you, it surely helps, but I need exact

But thank you, you’ve been a champ …

polterguy · February 13, 2023, 5:17am

THAT would be SUPER interesting, as in getting the C# version - Assuming it’s 100% “perfect” in its calculations

raymonddavey · February 13, 2023, 5:26am

I’m happy to release the c# version

There are differences that I have not nailed down. I am using the exact same token libraries .another member gave me a suggestion I never got time to follow through with

I’ll extract it from my main app and package it up for others to use soon. (It’s embedded in my larger app at the moment)

ruby_coder · February 13, 2023, 5:34am

Thank you!

We need a new OpenAI community rule, which I plan to propose to @logankilpatrick that all Q&A should take place in the community and not via email or private, direct messages; and we need to encourage everyone to post code because this is a developer community and the language of developers is code.

This type of “Q and A” tech forum rule that all Q & A should occur in the public forums is normal for tech communities because the purpose of having a public community is to share code and work on developer solutions in public to create a knowledge base for current and future users.

ruby_coder · February 13, 2023, 5:37am

I don’t think you will find any tokenizer which is “100%” perfect at counting tokens all the time, @polterguy ; writing a lot of code and I find the token approximates work fine especially because you cannot control exactly how the GPT model will reply and exactly how many tokens they will use in a reply.

Of course, I would love to see code which does not use an API to estimate GPT-2 token count, because I prefer to estimate tokens without making another network API call because the more network API calls made, the more the chance for errors, costs and delays.

According to the GPT-3 docs I have read, using a GPT-2 tokenizer is also an approximation for GPT-3 (not exact) and the most recommended way to get a GPT-3 token count is to submit an API call to a GPT-3 endpoint.

Hence, what is why OpenAI offers the “rules to estimate” token count, I think at least.

HTH

raymonddavey · February 13, 2023, 5:56am

Here is the correct link

Previous post deleted to avoid confusion

polterguy · February 13, 2023, 7:50am

I couldn’t agree more. I’ve got 10K commits to GitHub the last 5 years or so, and everything we’re doing in the AI space is 100% Open Sauce ==> GitHub - polterguy/magic: Generate a web app in seconds with Low-Code and AI

So I totally agree!

krisu.virtanen · February 14, 2023, 2:03pm

OpenAI tokenizer

ruby_coder · February 14, 2023, 2:13pm

Respect!

I am very biased toward coders and wish more people here would “speak” through code and working examples!

After all, this is supposed to be a community of developers / coders …

Topic		Replies	Views
Get all requested max tokens with gpt-3.5-turbo-instruct API gpt-35-turbo-instruc	20	7337	January 21, 2024
How to calculate the tokens when using function call API	58	47118	February 19, 2024
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4419	January 26, 2024
How to accurately price a gpt-4 chatbot? API gpt-4 , api	64	24083	February 6, 2024
Is the GPT4 api actually this limited or am I doing something wrong? API	13	1453	December 13, 2023

What is the OpenAI algorithm to calculate tokens?

Method to estimate tokens per OpenAI docs with method options

Examples

Modify and improve as you wish!

Related topics