I was re-reading the docs about frequency and presence penalties. There are no clauses about specific tokens being excluded (as far as I can tell).
As a long-time user of the OpenAI API, I’ve always wondered if it makes sense to exclude what are sometimes called function words or grammatical words, such as determiners and prepositions. From some quick checks using the tokenizer, it seems that these values are usually seen as their own tokens - they are also frequent and short enough to be considered their own tokens by most tokenization algorithms.
Penalizing the frequency of tokens may not make as much as much sense as penalizing content words such as nouns and adjectives. I was wondering if anyone else thought about this stuff as well.
The AI can understand a great many things unexpected, and also can’t understand a great many things we’d expect it to. However the highest comprehension will be in using the most common form of language and even the most common form of tokens (such as lower case with a space before).
Can we remove prepositions? Take the command: “then after you do that, do another thing for me.”. You’ve now given the AI a command “then you do that, do another thing me.” One might consider removing articles like “the” “a”, but they also carry meaning.
The penalties API parameters are mostly for discouraging a behavior where the AI repeats its output words or gets into a loop of the same answer.
A specific token can be encouraged or discouraged statistically by the logit_bias API parameter.
Instruction-following skill demonstration:
system: AI:{“always”: “answer incl sarcasm”}
user: star count?
assistant: Oh, let me just count the billions and billions of stars in the universe real quick. Give me a moment.
right. To clarify, I didn’t expect the AI to detect this nor was I suggesting prepositions to be removed from language.
I was wondering if it made sense for there to be a closed list of tokens that should have the alpha_fequency and alpha_presence computation applied to them.
There is evidence that the human brain processes function words differently (Broca’s aphasia is a strong source of such evidence). While heuristics that apply after logits are returned are certainly not expected to behave like the human mind.
More practically, I wonder if penalizing extremely common function words like “the” inadvertently can make a text shorter (by penalizing ‘the’ based on frequency, the frequency of the end of generation token may overtake it)