What is the reason for adding total 7 tokens?

backroot · August 29, 2023, 5:37am

A total of 7 tokens are added in the OpenAI API sample code below. Please tell me the reason and calculation logic.

tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n

Why 4 tokens?
“<|start|>{role/name}\n{content}<|end|>\n” is not 4 tokens.

num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>

Why 3 tokens?
”<|start|>assistant<|message|>” is not 3 tokens.

anon22939549 · August 29, 2023, 6:01am

What, exactly, are you trying to ask?

<|im_start|>user<|im_sep|><|im_end|>

Is four tokens: 100264, 882, 100266, 100265.

<|start|>assistant<|im_sep|>

Is three tokens: 100264, 78191, 100266.

backroot · August 29, 2023, 6:08am

anon22939549 · August 29, 2023, 6:30am

You’re not using the right tokenizer.

Anything inside of (and including) <||> is a special token, used to inform ChatGPT about the parts of the messages.

backroot · August 29, 2023, 6:40am

@anon22939549
I got it. Why every message follows 4 tokens and every reply is primed 3 tokens. I would like to know it.

anon22939549 · August 29, 2023, 6:43am

Because the user message needs to be identified as such. That takes 4 tokens, the start and end, the user role identifier, and the separator between the role and the message.
The same as 1 without the end which gets generated in the response. That’s 3 tokens.

backroot · August 29, 2023, 6:55am

@anon22939549
Thank you.
Do you know website about special token?

_j · August 29, 2023, 7:10am

You will note in the link, for current models (highlighting original gpt-3.5-turbo-0301 internal model selector endpoint is different):

if model in {
“gpt-3.5-turbo-0613”,
“gpt-3.5-turbo-16k-0613”,
“gpt-4-0314”,
“gpt-4-32k-0314”,
“gpt-4-0613”,
“gpt-4-32k-0613”,
}:
tokens_per_message = 3
tokens_per_name = 1

They don’t just include obfuscated comment for the code, there is no comment.

Overhead is different on these.

(edit: see later post for calculation with bare tokenizer input)

A set tokens_per_name, though, is unreliable; the colon is not always an additional token:

Untitled

The final three overhead tokens are end injection of “assistant:” prompting.

backroot · August 29, 2023, 7:32am

@_j
So tokens_per_name should be taken into account when using the name optional parameter.
tokens_per_name is not always 1 token.
Do i understand correctly?

_j · August 29, 2023, 7:54am

tokens_per_name = 1 is correct - unless you provide name inputs that aren’t (where a single token like “:x” demonstrated above, or even “:name” exists and would be utilized).

The overhead of one message = 7 billed tokens, the overhead of two = 11, three = 15.

This is a calculation scheme that is not broken by any inputs that seems reasonable:

Untitled

emaggiori · October 23, 2023, 1:34pm

Hi. Where did you find out the token IDs for those special tokens? I can only see the IDs of some of the special tokens using tiktoken (like <|endoftext|>) but not the rest. Thanks!

sector373 · December 11, 2023, 9:28pm

What is the difference in how special tokens in newer models are added (why is tokens_per_message 3 instead of 4)?

_j · December 11, 2023, 9:38pm

gpt-3.5-turbo-0301: 9 prompt tokens
gpt-3.5-turbo-0613: 8 prompt tokens
gpt-3.5-turbo-1106: 8 prompt tokens
gpt-3.5-turbo-16k-0613: 8 prompt tokens
gpt-4-0314: 8 prompt tokens
gpt-4-0613: 8 prompt tokens
gpt-4-1106-preview: 8 prompt tokens
gpt-4-vision-preview: 8 prompt tokens

I see no difference. The overhead per message is 3 tokens plus the role token such as “assistant”. gpt-3.5-turbo-0301 used an older ChatML format.

Topic		Replies	Views
Using the API the token count is off API	10	1560	January 16, 2024
When sending a message to OpenAI chat api does it add json special characters ex. "{" to the final amount of prompt_tokens? API chatgpt , token , billin	2	1310	August 29, 2023
Prompt tokens usage seems too high API api	1	2550	January 21, 2024
How does ChatML do the exact formatting? API	3	7849	June 6, 2023
How many tokens is normal usage for asking a question? API chatgpt	7	14000	September 6, 2024

What is the reason for adding total 7 tokens?

Related topics