Does the tokenizer https://platform.openai.com/tokenizer not handle whitespace correctly?

I have been using the tokenizer https://platform.openai.com/tokenizer to get a feel for what is going on.

In a sentence “an overstretched makeover” the tokenizer shows the following tokenization:
|an| over|stretched| make|over

Isn’t it the case that | over is in fact two tokens: |<whitespace>|over|? That is what GPT tells me when I ask. How does it really work?

Use

https://tiktokenizer.vercel.app/

instead. The tokenizer on the OpenAI site hasn’t been updated to use cl100k_base yet.

1 Like

I could use an explanation why the sequence “overpythonisation inoperativeness and overpythonisation” gives three different tokens for over? (927,3376,2017)

The same happens with the tokeniser on OpenAI’s site.

The origin of my search lies in trying to understand how the tokeniser handles whitespace. Why is there a token that includes a leading whitespace and why two that do not?

The whitespace character is tokenized with what follows it.

So,

over
 over

will have two different tokenizations.

Please note though oper and over are different strings which is why they are represented by different tokens.

1 Like

(Stupid error by me on oper/over)

Interestingly, ChatGPT itself told me the opposite…

Something I actually would be interested to know: how large are GPT’s token vocabularies? My current info is around 50k tokens, but is this correct?

(I am also interested in the size of the embeddings. My current info is that it is around 500 values per token, but I’ve seen mentions of thousands of values.)

The name of the tokenizer model used currently is cl100k_base, so 100,000 tokens.

Thank you. And embedding size is about 500 values per token?

The embedding size is different from model to model, I don’t think there’s one published for gpt4 but for DaVinci each token will have vector with 12288 dimensions, that number will probably be higher for gpt4.

Technically 100,256. :wink:

@gerben.wierda

Someone went ahead and made a list of all the tokens for anyone interested.

Whitespace is almost always tokenized with the word that follows it. If it wasn’t the required number of tokens for a piece of text would nearly double.

1 Like