A better explanation of "Top P"?

The tooltip for “Top P” in playground doesn’t really tell me what it does, qualitatively. I get that higher numbers get more diverse options for outputs considered, but I don’t have a good sense for practical differences.

Can someone provide some examples of when or why one would want that high or low? The tool tip example says 0.5 means it will consider half of the weighted options, but that doesn’t really explain what it means for any given sue case.

Perhaps some examples of the same prompt with different outputs generated at 1, 0.75, 0.5, 0.25, and 0 so we can see why and when we’d want to use each level?

7 Likes

Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p . The probability mass is then redistributed among this set of words. This way, the size of the set of words ( a.k.a the number of words in the set) can dynamically increase and decrease according to the next word’s probability distribution. Ok, that was very wordy, let’s visualize. [Source]

That page shows a bit more of Top K and other search methods - ie Beam, Greedy, Sampling.

From what I’ve gathered, @Andrew and others have recommended not using top_p if you’re using temperature, but it’s possible to do. In my experience, the best way to learn has been to experiment and play around.

You can see some examples here of top_p vs greedy, etc.

:speaking_head:Context:
On days when he woke early, the president
liked to have coffee in the oval office. There was
something about watching the sky grow lighter and
lighter as you sat your pajama’d behind in the most
powerful chair in the free world, sipping marine strength
coffee, that you just couldn’t achieve anywhere else.

:moneybag:Greedy: He’d been in the oval office for over a year
now, and he’d never been so happy to be there. He’d
been in the oval office for almost a year, and he’d never
been so happy to be there.

:vertical_traffic_light:BeamSearch (k=4): The President was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word.

:spoon:Sampling (t=1.0): You couldn’t be sure if that’s what
you were really doing, and If you decided to take the day
off. The president wanted you to take the day off, but he
wanted to maintain a curfew and use his influence wisely.

🇰Top-k Sampling (k=8): “I’m sorry, sir.” “No, it’s okay,
son,” the president said. “I understand.” “You’re going
to have to make a special trip down there to get that kid.
He has no idea where he’s going.”

:atom_symbol:Nucleus Sampling (p=0.9): But that wasn’t what drew
the president’s attention. He’d been seated for maybe a
minute when he noticed the other man. What was the
guy doing here?

The way I think about it is that top_p will choose only the top x% of possible values to return. So, a .8 top_p will gather all possible words that might come next but will choose from only the top 20%.

Hope that helps.

5 Likes

Temperature controls randomness, so a low temperature is less random (deterministic), while a high temperature is more random.

More technically, a low temperature makes the model more confident in its top choices, while temperatures greater than 1 decrease confidence in its top choices. An even higher temperature corresponds to more uniform sampling (total randomness). A temperature of 0 is equivalent to argmax/max likelihood, or the highest probability token.

top_p computes the cumulative probability distribution, and cut off as soon as that distribution exceeds the value of top_p. For example, a top_p of 0.3 means that only the tokens comprising the top 30% probability mass are considered.

7 Likes

Not in the playground, for now, but in code, you can set a temperature of up to 2 (which I’ve never found to be useful).

2 Likes

The guidance on not using both didn’t come from me. I’m all for pressing every button.

BTW: The HuggingFace explanation is directionally accurate but technically wrong. Nucleus sampling (Top K) is token based and not words based.

5 Likes

This is why I called in the expert. :slight_smile:

Great discussion!

Thank you all, for the different insights.

So, what I’m getting is, top_p shrinks or grows the “pool” of available tokens to choose from, the domain to select over. 1=big pool, 0=small pool. Within that pool, each token has a probability of coming next.

Temperature is then a sort of fuzz-factor on rolling the dice to choose a next token from the available pool that top_p provided. temp=1 pushes the dice to consider less likely tokens, while temp=0 says to more strongly favor the already most likely tokens.

top_p circumscribes the domain of choice (allowing or excluding less likely options to play at all) and temp says that within that domain, how closely do you want to go with the most likely choice.

I’m being deliberately imprecise here, but is that the basic gist of how the two interact?

2 Likes