How to do a quick estimation of token count of a text?

hpstoerr · June 24, 2023, 8:27pm

Hi!

Did any of you find a method to quickly estimate the token count of a text? For scripts or small time applications it’s a bit unpleasant to import a library or have it munch that 1.9MB cl100k_base.tiktoken file yourself, just to decide how much stuff you’re going to forward to ChatGPT or whether to use that new 32k model. There is that “one token is about 3/4 word” estimation, but that’s just for english plain text, and you’re lost for other languages and for source code it’s even worse.

As a quick guesstimate I collected a couple of types of text and source code from my computer and wrote something to determine for each character the average of 1/token length for tokens that contain that character, and made up some groups of characters where that value is close:

spaces that are following a space: 0.081
NORabcdefghilnopqrstuvy and single space : 0.202
CHLMPQSTUVfkmspwx : 0.237
-.ABDEFGIKWY_\r\tz{ü : 0.304
!$&(/;=JX`j\n}ö : 0.416
"#%)*+56789<>?@Z[\]^|§«äç’ : 0.479
,01234:~Üß and characters > 255 : 0.658
other characters: 0.98

If you just add those numbers for all the characters of a file, that comes reasonably close:

file type | real token count vs. guesstimate
css 123491 vs. 103405
html 232691 vs. 243483
java 671616 vs. 757334
js 838884 vs. 825870
md 60583 vs. 59638
xml 912672 vs. 857563

Of course, that is pretty specific to my text- and codebase. Do you have better ideas?

In case you care: that estimation function and the statistics creating function are in my token counting script. (The implementation of real token counting in there is embarassingly inefficient, but was quick enough for my current purposes.)

PS: does anybody know how ChatGPT deals with unicode characters that are not in the cl100k_base.tiktoken file? I found some even in a ChatGPT generated text.

Best regards,
Hans-Peter

twilson · February 19, 2024, 4:56pm

This was enlightening. Thanks very much for sharing this information and tool!

Topic		Replies	Views
Rules of Thumb for number of source code characters to tokens API	5	9163	February 14, 2024
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	27250	December 13, 2023
Token Counter / Splitter? Community chatgpt	2	1090	August 3, 2023
All languages are NOT created (tokenized) equal Community token , app , comparison , statistics	9	4933	December 17, 2023
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1873	September 4, 2023

How to do a quick estimation of token count of a text?

Related topics