I could use an explanation why the sequence “overpythonisation inoperativeness and overpythonisation” gives three different tokens for over? (927,3376,2017)
The same happens with the tokeniser on OpenAI’s site.
The origin of my search lies in trying to understand how the tokeniser handles whitespace. Why is there a token that includes a leading whitespace and why two that do not?
Interestingly, ChatGPT itself told me the opposite…
Something I actually would be interested to know: how large are GPT’s token vocabularies? My current info is around 50k tokens, but is this correct?
(I am also interested in the size of the embeddings. My current info is that it is around 500 values per token, but I’ve seen mentions of thousands of values.)
The embedding size is different from model to model, I don’t think there’s one published for gpt4 but for DaVinci each token will have vector with 12288 dimensions, that number will probably be higher for gpt4.
Someone went ahead and made a list of all the tokens for anyone interested.
Whitespace is almost always tokenized with the word that follows it. If it wasn’t the required number of tokens for a piece of text would nearly double.