What I particularly want to know is what happens when top_p is set to zero. Will only the most probable tokens be adopted, or will it result in the same outcome as setting top_p to null, thereby making it invalid?
I’m aware that there are some discussions about setting top_p to zero in this forum, but is there any official information available anywhere?
What kind of short-circuit is placed for top_p: 0 (or temperature) currently is unknown and likely will not be discussed.
I did extensive trials when deterministic GPT-3 models from OpenAI existed.
Particularly, in the case of exploring completions with near-identical top-2 logits, where temperature but not top_p can distinguish them. (which btw are hard to find). You could get alternate answers.
The very small number instead of 0 selects the first logit when they appear identical to that part of nucleus sampling. At that scale it cannot include more than one token, even if all 100k tokens were identical logprob. Something where you know how it works is my choice.
All models now have non-deterministic inference, and the actual top token can switch positions with another of similar likelihood despite the setting.
Thank you all for your responses!
I got that no official announcement on what exact value is applied to top_p when it is set to 0. I think it would be a good idea for me to set top_p to a finite small value and set temperature to 0 when I want as deterministic a result as possible. Unlike top_p, the official documentation clearly states that the temperature can range from 0 to 2. (I think this temperature is something like a temperature in the softmax function, though no formula has been published for this either.)