I’m still a little confused about the difference between Frequency Penalty and Presence Penalty.
Is this a scaling thing, where presence penalty is a flat reduction if the token has appeared at least once before, while frequency penalty is bigger if the token has appeared multiple times?
Also, is there an easy way to implement a consecutive token penalty, that scales on the number of identical tokens in a row?
For instance, if I previously had “We the people of the United States” somewhere in the document, and the most recent 5 words are “We the people of the”, it should penalize ‘United’, and if it does select ‘United’, then the next token selection will even more harshly penalize ‘States’, and so on.
That seems like it would be more directly aimed at the issue I’m seeing a lot of (repeated long sequences of words), but it’s possible I’m just misunderstanding how Frequency and Presence work.
That level of granularity isn’t really possible, but I’d try iteratively increasing the penalty value (e.g. +0.1 at a time) to see how it impacts repetition.
Frequency_penalty and presence_penalty are two parameters that can be used when generating text with language models, such as GPT-3.
Frequency_penalty: This parameter is used to discourage the model from repeating the same words or phrases too frequently within the generated text. It is a value that is added to the log-probability of a token each time it occurs in the generated text. A higher frequency_penalty value will result in the model being more conservative in its use of repeated tokens.
Presence_penalty: This parameter is used to encourage the model to include a diverse range of tokens in the generated text. It is a value that is subtracted from the log-probability of a token each time it is generated. A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text.
Both of these parameters can be adjusted to influence the overall quality and diversity of the generated text. The optimal values for these parameters may vary depending on the specific use case and desired output.