Why ChatGPT Using r50k_base but Chat API using cl100k_base?

I previously used the Chat API to handle some messages, and by chance, I tried sending the same content on ChatGPT and received an error message saying “The message you submitted was too long”.

After some difficult exploration, I discovered that ChatGPT uses the tokenizer r50k_base for length determination, while the API uses cl100k_base.

Is there a problem with this? When using ChatGPT in languages other than English, the available text length often decreases by more than half.

It would take a lot of precise measurement and stimulating error production to find the tokenizer by input. However, it is easy to see that ChatGPT uses a gpt-3.5-turbo model and 100k dictionary encoder by:

  • producing text that runs ChatGPT up to its max_token value of 1536 (continue button appears along with truncated text obtained, by a command to repeat or produce a long list or document correction.)
  • paste that text into an online tokenizer after clearing the input box to 0 tokens.

You can also see with the tokenizer that there are other languages that use more tokens per character. However they may use less tokens per idea and translated document also.

The input box of ChatGPT has its own independent character counter/estimator.

I didn’t quite understand. I only tested the GPT-4 model of Plus and found that it does use the r50k_base algorithm, which significantly reduces the amount of content that non-native English speakers (or Latin language speakers?) can input.

Using the GPT-4 of Plus to test prompt phrases is clearly much cheaper than using the API. However, the difference in content length between the two is causing us a lot of trouble

Again you seem to have come to wrong conclusions.

Here’s a long ChatGPT generation task, truncated to max_tokens (and the “continue generating” button appears):

Press copy button in ChatGPT for the raw text, and paste clipboard contents into tokenizer:

You get one answer for the number of tokens in the response, the same number every time, but only when you choose the correct token encoder, which is cl100k

The max_tokens of ChatGPT is 2^10 * 1.5

I’m talking about the number of tokens in a single input. According to my tests, the GPT-4 (default) model’s limit is a single input of 4000 Tokens(r50k_base) or 16000 string length; The GPT-4-plugins model’s limit is a single input of 8000 Tokens(r50k_base) or 32000 string length.

Do you understand what I mean? I’m talking about the content length detection of the ChatGPT input box, that is, the condition when “The message you submitted was too long” appears, not the encoder actually used by the model.

Yes, you can expect that they aren’t going to run a full BPE token encoder with 2MB dictionary in client side code just to validate how much you can put into the input box, up to the very limit of the model. The input is estimated, which is the very reason why I’m able to jam it up on 8000 tokens of Chinese pseudotext:


You also can’t employ the full context of the model just on input, there’s a reservation for forming your output, the chat history area, system messages, custom instructions, enabled plugins or prompting for other modes, and the distinct allowances for those which are proprietary knowledge… So you get the allowed input space they want you to have, and denial by either the user interface estimation method or the endpoint response if you go over. This is not a measure of the model’s token encoder.