Official token count differs from OpenAI tokenizer

edoardo.luciani · May 9, 2021, 6:01pm

I’ve been trying to measure the number of tokens my request has and my measurement agrees with https://beta.openai.com/tokenizer?view=bpe, but differs from what https://beta.openai.com/account/usage shows.

This is the prompt and response I got and the respective counts.
[Prompt]

Joker is a clown that tells a funny joke when asked to. If a topic is specified in the prompt, the joke needs to be relevant.

Q: Tell a joke.
Joker: What’s the best thing about Switzerland? I don’t know, but the flag is a big plus.

Q: Tell a joke.
Joker: Three men enter a bar in the USSR. One says, ‘Why did Stalin only write in lowercase?’ The other one says, ‘Because he was afraid of capitalism.’

Q: Tell a joke about games.
Joker: Yesterday I saw a guy spill all his Scrabble letters on the road. I asked him, ‘What’s the word on the street?’

Q: Tell a joke.
Joker:

[Response]

What do you call a woman who is working on her computer? A gamer.

Note: There is a single space character before the response.

OpenAI Tokenizer | OpenAI Usage | My Measurement
Prompt: 167 167 167
Response: 16 24 16

Does anybody know how can I get the official number of tokens? For now I am using GPT2TokenizerFast.

boris · May 9, 2021, 6:57pm

What’s the usage that shows up which is different than the numbers you’ve calculated? The usage is tokens of prompt + completion

edoardo.luciani · May 9, 2021, 7:11pm

On this page Account Usage, for that request I see this:

which is different than the numbers I calculated (167 + 16).

boris · May 9, 2021, 10:57pm

You said your second response was 24 tokens. There’s a small delay on the billing page - it might be that?

edoardo.luciani · May 10, 2021, 12:04am

I am 100% sure that the billing entry corresponds to that query, since it was the last one of the hour as you can see:

Sorry I messed up in my previous response by sending you a different billing entry (even though they have the same values).

joey · May 10, 2021, 7:07am

Hi there, whitespace and new lines also add tokens (which you may have stripped?). I see that in your screenshot there are also values after the written completion ("| 183…").

edoardo.luciani · May 10, 2021, 10:32am

“| 183” is not part of the response and it has been added for debugging purposes.

I repeated the measurements again with 4 different responses and the same prompt as in the original post. The bold text is not part of the response and shows my calculated prompt + response tokens.

I don’t know, but I think it’s a good thing.(167 + 14)

This joke is so dirty that even my mom would blush.(167 + 12)

What did one monkey say to another after a close call?(167 + 12)

I went to the supermarket to buy my wife a present. I saw her favourite brand of beer, so I bought it for her. But she wouldn’t accept it, saying that we don’t drink beer anymore, and that there’s no point in buying something which will soon be out of date. I asked her why we stopped drinking beer, and she said that after a week of not drinking any alcohol at all she was going crazy with withdrawal symptoms.(167 + 91)

Note: There is a single space character before every response

This is my request breakdown for those 4 queries:

Unfortunately the two last have been merged into one single billing entry, but the completion values differ a bit from my calculations.

joey · May 10, 2021, 10:48am

Hi there, would it be possible to share a playground link so I could try to replicate this behaviour? As now I’m just seeing pasted completions, it’d be great to see the behaviour first-hand.

edoardo.luciani · May 10, 2021, 11:06am

Here is the link: Playground

joey · May 10, 2021, 11:55am

Hi there, I was not able to replicate the behaviour you’re discussing.

edoardo.luciani · May 10, 2021, 10:10pm

After pasting the response did you add a newline? Because I get 24 tokens without it.

TimC · May 11, 2021, 2:53am

What browser are you using? Sometimes copypaste behavior is inconsistent when dealing with trimmable whitespace characters.

edoardo.luciani · May 11, 2021, 10:19am

Google Chrome Version 90.0.4430.212 (Official Build) (64-bit) on Linux

itsanishjain · June 24, 2021, 8:29am

First of all why we even need to calculate tokens using like GPT2 tokenizer if out dashboard knows how much token we used why GPT3 API just send same as in response body please @joey let me know if I am wrong here because that’s really important when we make our in productions

and @edoardo.luciani do you find some work around

jloganolson · July 21, 2021, 8:50pm

Following up on this - barring progress on the token count being included in the API, could you confirm that HF’s GPT2Tokenizer.from_pretrained(“gpt2”) is the one we should be using (versus something from GPT2TokenizerFast)?

Topic		Replies	Views
Official tokenizer has huge count difference from OpenAI tokenizer API	12	4539	October 1, 2023
Using the API the token count is off API	10	1438	January 16, 2024
Chat Token counts inconsistency between playground platform and tiktokenizer API chatgpt , token	2	641	December 27, 2024
Feature request: Query token counts via API Prompting	3	1608	May 24, 2022
How do I calculate the pricing for generation of text? API	11	7162	March 6, 2023

Official token count differs from OpenAI tokenizer

Related topics