The tokens are integers that are then transformed inside the model to floating point vectors.
Tokens are the atomic units of the LLM. So you would think, why not just create a token out of each letter or character? You can, but while this has a small set of tokens, small vocabulary and really no OOV words (Out of Vocabulary), each character contains little to no information.
So training on characters leads to an information starved per token system.
On the other hand, you could have each word be its own token. Here you there is maximum information per word, but given al the misspellings, you end up with a massive set of tokens, and too much information per word, and lots of OOV words.
So the middle ground is sub-word or small word tokens. Here you have a relatively small set of tokens, a lower amount of OOV words, and the right balance of information per token.
GPT has gone from a a 50k token library, to now a 100k token library. More tokens is generally better, if you can handle it, so as time goes on, we may get massive million token tokenizers, but we aren’t there.
So why not just go straight to a vector? Well, you could map each word to it’s own vector, I have done this, and what happens is you get into the “large token” scenario, where there is too much information coming in for the network to handle. So you back off, and go to sub-word, similar to the tokenizers today.
You are trying to make a decision in the network with limited computing resources. So the tokens transform into vectors that are now in a continuous space, and in this space, close vectors have close meaning. So you are globbing meaning to localized chunks in the space, instead of each different thing has a dramatically different internal representation. You need many neurons to make sense of this, so you have to localize and linearize things to get it to work, and get down to a computable number of neurons.
Anyway, hope my rambling helps