Does the tokenizer https://platform.openai.com/tokenizer not handle whitespace correctly?

gerben.wierda · August 30, 2023, 12:32pm

I have been using the tokenizer https://platform.openai.com/tokenizer to get a feel for what is going on.

Isn’t it the case that | over is in fact two tokens: |<whitespace>|over|? That is what GPT tells me when I ask. How does it really work?

anon22939549 · August 30, 2023, 3:08pm

Use

https://tiktokenizer.vercel.app/

instead. The tokenizer on the OpenAI site hasn’t been updated to use cl100k_base yet.

gerben.wierda · September 1, 2023, 11:33am

I could use an explanation why the sequence “overpythonisation inoperativeness and overpythonisation” gives three different tokens for over? (927,3376,2017)

The same happens with the tokeniser on OpenAI’s site.

The origin of my search lies in trying to understand how the tokeniser handles whitespace. Why is there a token that includes a leading whitespace and why two that do not?

anon22939549 · September 1, 2023, 3:41pm

The whitespace character is tokenized with what follows it.

So,

over
 over

will have two different tokenizations.

Please note though oper and over are different strings which is why they are represented by different tokens.

gerben.wierda · September 2, 2023, 1:23pm

(Stupid error by me on oper/over)

Interestingly, ChatGPT itself told me the opposite…

Something I actually would be interested to know: how large are GPT’s token vocabularies? My current info is around 50k tokens, but is this correct?

(I am also interested in the size of the embeddings. My current info is that it is around 500 values per token, but I’ve seen mentions of thousands of values.)

Foxalabs · September 2, 2023, 1:30pm

The name of the tokenizer model used currently is cl100k_base, so 100,000 tokens.

gerben.wierda · September 2, 2023, 2:31pm

Thank you. And embedding size is about 500 values per token?

N2U · September 2, 2023, 2:43pm

The embedding size is different from model to model, I don’t think there’s one published for gpt4 but for DaVinci each token will have vector with 12288 dimensions, that number will probably be higher for gpt4.

anon22939549 · September 2, 2023, 3:57pm

Technically 100,256.

@gerben.wierda

Someone went ahead and made a list of all the tokens for anyone interested.

github.com

kaisugi/gpt4_vocab_list/blob/main/cl100k_base_vocab_list.txt

'!'
'"'
'#'
'$'
'%'
'&'
"'"
'('
')'
'*'
'+'
','
'-'
'.'
'/'
'0'
'1'
'2'
'3'
'4'

This file has been truncated. show original

Whitespace is almost always tokenized with the word that follows it. If it wasn’t the required number of tokens for a piece of text would nearly double.

Topic		Replies	Views
Question about function completion model tokenization API	3	427	July 12, 2023
Does function calling output charge for white space? API api , function-calling	6	1573	July 27, 2023
Tokenizer used by Assistant API for chunking not correct! API assistants-api	3	147	November 2, 2024
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1942	September 4, 2023
Chat Token counts inconsistency between playground platform and tiktokenizer API chatgpt , token	2	666	December 27, 2024

Does the tokenizer https://platform.openai.com/tokenizer not handle whitespace correctly?

Related topics