How does ChatML do the exact formatting?

AgusPG · March 1, 2023, 10:30pm

Hey @logankilpatrick. I’m trying to pre-compute the exact number of tokens in my prompt before sending a request to the new Chat endpoint using tiktoken. I’m following the guidelines that you guys provide here to format the prompt from the list of messages. But it seems that the number of prompt tokens in completion.usage.prompt_tokens is always significantly lower than the one that I get formatting the prompt as in the link. For instance:

messages = [{'role': 'system',
  'content': 'You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.'},
 {'role': 'user', 'content': 'Hello world!'},
 {'role': 'assistant', 'content': 'Hello there!'},
 {'role': 'system', 'content': 'Now, you are Elon Musk. Speak like him.'},
 {'role': 'user', 'content': 'Hello world!'}]

would be formatted as:

<|im_start|>system
You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.<|im_end|>
<|im_start|>user
Hello world!<|im_end|>
<|im_start|>assistant
Hello there!<|im_end|>
<|im_start|>system
Now, you are Elon Musk. Speak like him.<|im_end|>
<|im_start|>user
Hello world!<|im_end|>
assistant

According to tiktoken, this prompt has 129 tokens. But my api call says that the prompt has 70 tokens.
If I do not include the special tokens <|im_start|> and <|im_end|>, I almost get it but not quite: 61 tokens. Is there any way we can pre-compute the exact number of tokens in our prompt before sending the actual request?
Thanks a lot!!

logankilpatrick · March 1, 2023, 10:48pm

How many tokens do you get when you use tiktoken on the text in the messages list at the top?

AgusPG · March 1, 2023, 10:57pm

You mean this guy?:

import tiktoken
encoding = tiktoken.get_encoding("gpt2")

def num_tokens_from_string(string, encoder) -> int:
    return len(encoder.encode(string))

s = 'You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.'
num_tokens_from_string(s, encoding)

Output: 22

shapor · June 6, 2023, 4:44am

The reason for this is the having to account for the additional tokens used for the system messages.
This is explained in the Microsoft document titled “Learn how to work with the ChatGPT and GPT-4 models”, with python code for calculating token counts (posting links isn’t allowed).

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    encoding = tiktoken.encoding_for_model(model)
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":  # if there's a name, the role is omitted
                num_tokens += -1  # role is always required and always 1 token
    num_tokens += 2  # every reply is primed with <im_start>assistant
    return num_tokens

Topic		Replies	Views
Prompt_tokens vs tiktoken.encoding_for_model().encode() Prompting gpt-35-turbo , token	4	4375	August 3, 2023
Using the API the token count is off API	10	1282	January 16, 2024
Prompt tokens usage seems too high API api	1	2149	January 21, 2024
What is the reason for adding total 7 tokens? API chatgpt , api	12	3627	December 11, 2023
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1716	September 4, 2023

How does ChatML do the exact formatting?

Related Topics