I noticed this a while back. Any idea what tokenizer OpenAI’s tool is using. It says it’s the tokenizer for GPT-3
, which should be either p50k_base or r50k_base, but I do not get the same token count when calculating tokens using tiktoken in python (in a Google Colab notebook), as I do when I put the same text string into the OpenAI website.
For a given sample, I get 480 tokens from cl100k_base, 485 from either p50k or r50k, and around 503 from the website. That means the website seems to correspond to no known tokenizer encoding base supported by tiktoken, but it even shows you in delightful color-coded chunks what it’s doing to the text.
Very odd. It doesn’t really matter for me b/c I do a lot of token calculations prior to sending the API calls, but also generations are not of predictable length even with token limits, so I make sure there’s “padding” around my prompts so that token limits aren’t exceeded. Or, more recently I’m just using 3.5-turbo-16k but with about 5k input tokens, as that tends to yield the best results for my needs.