To use the logit_bias parameter to encourage or discourage tokens (via the API call), one must find the correct token number, using the correct tokenizer for the model.
This doesn’t have many useful applications, as there are many alternate tokens that seem to be the same word, such as tokens with or without a beginning space, capital words, and even tokens for words that start with a hyphen or underscore. Demoting tokens representing parts of words (like psych-ology) can broadly impact other generation.
One last question if you don’t mind, when I run this sentence through the tokenizer and get output like attached … those are the number into the word dictionary, correct?. In theory I’ve always seem them represent perhaps sequential order … like on a 50,000 word dictionary “a” would be a lower number (index into dictionary) than zebra.
But either I’ve making a mistake on interpreting the screen shot OR I just assume since all the examples were built that way in introduction material. Which is it?
While some tokens are just letters or ASCII and are found early by being manually put in the token dictionary, many others are the result of discovery of ways to compress data of the training corpus.
However, the tokenizer link I gave will also allow you to check the tokenizer of recommended models (and not just particular old ones like the Openai one you show), and see the individual tokens marked by color.