I am trying to ban the word ‘assume’, in my generations. I am trying to use logit bias parameter to accomplish this but the word ‘assume’ as per the OpenAI tokenizer site takes up two tokens. So do I ban both of these tokens?
Hi there! Due to the way tokens work, it’s probably best to try to suppress “ assume”, including the preceding space (token ID 7048). This won’t reliably remove all mentions off assume, as it will ignore “assume” without preceding space, or “Assume” (both with and without space), but I suppose it will help a fair bit.
The site you link is the 50k tokenizer for GPT-3 models, but the engine you’d be using through ChatComplete is GPT-3.5-turbo or 4. They use a 100k dictionary with different token numbers.
" assume": 9855
“assume: 46151
" Assume”: 63297
“Assume”: 5733 + 3972 (and other variants are also compound tokens)
“presume”: 24544 + 3972 (showing that you will stifle language if you block a fragment)