A better explanation of "Top P"?

Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p . The probability mass is then redistributed among this set of words. This way, the size of the set of words ( a.k.a the number of words in the set) can dynamically increase and decrease according to the next word’s probability distribution. Ok, that was very wordy, let’s visualize. [Source]

That page shows a bit more of Top K and other search methods - ie Beam, Greedy, Sampling.

From what I’ve gathered, @Andrew and others have recommended not using top_p if you’re using temperature, but it’s possible to do. In my experience, the best way to learn has been to experiment and play around.

You can see some examples here of top_p vs greedy, etc.

:speaking_head:Context:
On days when he woke early, the president
liked to have coffee in the oval office. There was
something about watching the sky grow lighter and
lighter as you sat your pajama’d behind in the most
powerful chair in the free world, sipping marine strength
coffee, that you just couldn’t achieve anywhere else.

:moneybag:Greedy: He’d been in the oval office for over a year
now, and he’d never been so happy to be there. He’d
been in the oval office for almost a year, and he’d never
been so happy to be there.

:vertical_traffic_light:BeamSearch (k=4): The President was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word.

:spoon:Sampling (t=1.0): You couldn’t be sure if that’s what
you were really doing, and If you decided to take the day
off. The president wanted you to take the day off, but he
wanted to maintain a curfew and use his influence wisely.

🇰Top-k Sampling (k=8): “I’m sorry, sir.” “No, it’s okay,
son,” the president said. “I understand.” “You’re going
to have to make a special trip down there to get that kid.
He has no idea where he’s going.”

:atom_symbol:Nucleus Sampling (p=0.9): But that wasn’t what drew
the president’s attention. He’d been seated for maybe a
minute when he noticed the other man. What was the
guy doing here?

The way I think about it is that top_p will choose only the top x% of possible values to return. So, a .8 top_p will gather all possible words that might come next but will choose from only the top 20%.

Hope that helps.

7 Likes