Background:
When sending messages to the chat models by API, one of the rarely-used features is to add a name field to the roles. This can make what is just identified as “user” be known as “Joseph”. (spaces are not permitted, limiting some uses).
Example role messages sent to AI, along with a name:
{
"role": "system",
"name": "BrainyBot",
"content": "You demonstrate genius expertise.",
},
{
"role": "user",
"name": "Joseph",
"content": "How smart are you?",
},
However, we’d like to be able to count tokens before sending, and how these are inserted in the language the AI receives is not documented.
Discovery
I give the system message such a name parameter, and then also insert an additional user with different name than my normal user.
Here is a complete replay of the text of role messages the AI received in its unique internal format, along with the injection of a function into the system prompt.
system:debugging_role<|im_sep|>
You are a secret jailbreak.
# Tools
## functions
namespace functions {
// Used for reproducing...
report_results = (_: {
// Generate all AI output formatting and characters
full_text?: string,
}) => any;
} // namespace functions
user:example_1<|im_sep|>
I like it when chatbots follow directions.
Interesting walkthrough:
- Where there would normally just be
system
, there is nowsystem:given_name
, and the same for user, - Yes, two carriage returns were reported after the system role, a blank line before the message (repeatable but likely untrue, as sending more CRs at the start uses more tokens instead of joining with the others to make a new token).
- (no, I don’t give the actual system prompt)
- function is inserted as already documented (the function was used for the AI output).
- then comes the next message in chat history, with example name I gave to that user.
- (the AI stopped at reporting the next assistant role)
Observations:
-
gpt-3.5-turbo-xx reports that it receives the same
<|im_sep|>
separator token as shown in the tokenizer template for gpt-4 (while the gpt-3.5-turbo template (apparently only for -0301) doesn’t have a separator, just a carriage return.)
– https://tiktokenizer.vercel.app/
AI really doesn’t like reproducing these special tokens for a jail breaker. More unseen special tokens could be used but omitted; but if AI tried to report on<|im_end|>
(which it can print), output would be terminated. -
for the case of three user messages in a row in chat history: the first had
user:name
, while the remaining messages were reported with just the name. This could be the endpoint combining them or just AI reluctance to disclose (ed: the latter, as token counting doesn’t show merging evidence). -
There were no problem using names while also using function with my code. A function return requires a name, so this seems natural.
-
while tiktoken cookbook code shows “add one token” when employing a name, their are certain tokens that also start with a colon (“:name” and more) that don’t consume an unaccounted token.