How would I remove complex words from my generation like xargs, which has token id [87, 22046] either by using logit_bias (a bit confused on how the dictionary should be structured then) or by using an instruction in the prompt (have been using this but it still creeps in the output generation once in a while and I would really really want to not generate it) ?
you can use this form to specify a token you don’t want, you can add more to the list, but the 87 looks like it might be pretty common, so stick to the rarer one. Also give it a temp above 0 to give it some wiggle room in creating new replies without that token.
openai.ChatCompletion.create(
model="gpt-3.5-turbo",
logit_bias= {"22046":-100},
temperature= 0.5,
messages=[{"role": "user", "content": "Do all the things"}]
)
Lemme check this out and see how it does. I am a bit hesitant to raise the temperature because of how deterministic i want the output to be but some wiggle room can be afforded. Thanks !
Good point made by @_j there, you seem to have used the OpenAI tokenizer site to generate your token ID, that is not using the latest tokenizer model. TikToken library from OpenAI can correct this for you when using the cl100k model. see :
What would likely happen is that it produces the x as expected where it is expected, but then the next token is prohibited so it will revert to the next highest weighted token.
After done having fun with logit_bias as a parameter, then you get to make more forceful prompting.