Removing complex words from generation

Hello Everyone.

How would I remove complex words from my generation like xargs, which has token id [87, 22046] either by using logit_bias (a bit confused on how the dictionary should be structured then) or by using an instruction in the prompt (have been using this but it still creeps in the output generation once in a while and I would really really want to not generate it) ?

Any advice or help is appreciated.


1 Like


you can use this form to specify a token you don’t want, you can add more to the list, but the 87 looks like it might be pretty common, so stick to the rarer one. Also give it a temp above 0 to give it some wiggle room in creating new replies without that token.

  logit_bias= {"22046":-100}, 
  temperature= 0.5,
  messages=[{"role": "user", "content": "Do all the things"}]
1 Like

Lemme check this out and see how it does. I am a bit hesitant to raise the temperature because of how deterministic i want the output to be but some wiggle room can be afforded. Thanks !

Actual 100k tokens used by chat models:
[87, 2164]

" xargs"
[865, 2164]

Good point made by @_j there, you seem to have used the OpenAI tokenizer site to generate your token ID, that is not using the latest tokenizer model. TikToken library from OpenAI can correct this for you when using the cl100k model. see :

Good catch @_j . I’ve made the correction. Any idea though on how to pass [87, 2164] to the logit_bias parameter ?

If you pass just the 2164 one that is going to be the “args” bit I would guess, so without the ability to say args it will not say xargs…

My test script is just about to finish but the results looks promising, Thanks for the quick and prompt (hehehe) reply folks !

1 Like

What would likely happen is that it produces the x as expected where it is expected, but then the next token is prohibited so it will revert to the next highest weighted token.

After done having fun with logit_bias as a parameter, then you get to make more forceful prompting.