How to do a quick estimation of token count of a text?


Did any of you find a method to quickly estimate the token count of a text? For scripts or small time applications it’s a bit unpleasant to import a library or have it munch that 1.9MB cl100k_base.tiktoken file yourself, just to decide how much stuff you’re going to forward to ChatGPT or whether to use that new 32k model. There is that “one token is about 3/4 word” estimation, but that’s just for english plain text, and you’re lost for other languages and for source code it’s even worse.

As a quick guesstimate I collected a couple of types of text and source code from my computer and wrote something to determine for each character the average of 1/token length for tokens that contain that character, and made up some groups of characters where that value is close:

spaces that are following a space: 0.081
NORabcdefghilnopqrstuvy and single space : 0.202
CHLMPQSTUVfkmspwx : 0.237
-.ABDEFGIKWY_\r\tz{ü : 0.304
!$&(/;=JX`j\n}ö : 0.416
"#%)*+56789<>?@Z[\]^|§«äç’ : 0.479
,01234:~Üß and characters > 255 : 0.658
other characters: 0.98

If you just add those numbers for all the characters of a file, that comes reasonably close:

file type | real token count vs. guesstimate
css 123491 vs. 103405
html 232691 vs. 243483
java 671616 vs. 757334
js 838884 vs. 825870
md 60583 vs. 59638
xml 912672 vs. 857563

Of course, that is pretty specific to my text- and codebase. Do you have better ideas?

In case you care: that estimation function and the statistics creating function are in my token counting script. (The implementation of real token counting in there is embarassingly inefficient, but was quick enough for my current purposes.)

PS: does anybody know how ChatGPT deals with unicode characters that are not in the cl100k_base.tiktoken file? I found some even in a ChatGPT generated text.

Best regards,