What is the OpenAI algorithm to calculate tokens?

bobartig · August 1, 2023, 4:58pm

I noticed this a while back. Any idea what tokenizer OpenAI’s tool is using. It says it’s the tokenizer for GPT-3, which should be either p50k_base or r50k_base, but I do not get the same token count when calculating tokens using tiktoken in python (in a Google Colab notebook), as I do when I put the same text string into the OpenAI website.

For a given sample, I get 480 tokens from cl100k_base, 485 from either p50k or r50k, and around 503 from the website. That means the website seems to correspond to no known tokenizer encoding base supported by tiktoken, but it even shows you in delightful color-coded chunks what it’s doing to the text.

Very odd. It doesn’t really matter for me b/c I do a lot of token calculations prior to sending the API calls, but also generations are not of predictable length even with token limits, so I make sure there’s “padding” around my prompts so that token limits aren’t exceeded. Or, more recently I’m just using 3.5-turbo-16k but with about 5k input tokens, as that tends to yield the best results for my needs.

Topic		Replies	Views
Get all requested max tokens with gpt-3.5-turbo-instruct API gpt-35-turbo-instruc	20	7406	January 21, 2024
How to calculate the tokens when using function call API	58	47913	February 19, 2024
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4453	January 26, 2024
How to accurately price a gpt-4 chatbot? API gpt-4 , api	64	24535	February 6, 2024
Is the GPT4 api actually this limited or am I doing something wrong? API	13	1492	December 13, 2023

What is the OpenAI algorithm to calculate tokens?

Related topics