The tooltip for “Top P” in playground doesn’t really tell me what it does, qualitatively. I get that higher numbers get more diverse options for outputs considered, but I don’t have a good sense for practical differences.
Can someone provide some examples of when or why one would want that high or low? The tool tip example says 0.5 means it will consider half of the weighted options, but that doesn’t really explain what it means for any given sue case.
Perhaps some examples of the same prompt with different outputs generated at 1, 0.75, 0.5, 0.25, and 0 so we can see why and when we’d want to use each level?
Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p . The probability mass is then redistributed among this set of words. This way, the size of the set of words ( a.k.a the number of words in the set) can dynamically increase and decrease according to the next word’s probability distribution. Ok, that was very wordy, let’s visualize. [Source]
That page shows a bit more of Top K and other search methods - ie Beam, Greedy, Sampling.
From what I’ve gathered, @Andrew and others have recommended not using top_p if you’re using temperature, but it’s possible to do. In my experience, the best way to learn has been to experiment and play around.
On days when he woke early, the president
liked to have coffee in the oval office. There was
something about watching the sky grow lighter and
lighter as you sat your pajama’d behind in the most
powerful chair in the free world, sipping marine strength
coffee, that you just couldn’t achieve anywhere else.
Greedy: He’d been in the oval office for over a year
now, and he’d never been so happy to be there. He’d
been in the oval office for almost a year, and he’d never
been so happy to be there.
BeamSearch (k=4): The President was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word.
Sampling (t=1.0): You couldn’t be sure if that’s what
you were really doing, and If you decided to take the day
off. The president wanted you to take the day off, but he
wanted to maintain a curfew and use his influence wisely.
🇰Top-k Sampling (k=8): “I’m sorry, sir.” “No, it’s okay,
son,” the president said. “I understand.” “You’re going
to have to make a special trip down there to get that kid.
He has no idea where he’s going.”
Nucleus Sampling (p=0.9): But that wasn’t what drew
the president’s attention. He’d been seated for maybe a
minute when he noticed the other man. What was the
guy doing here?
The way I think about it is that top_p will choose only the top x% of possible values to return. So, a .8 top_p will gather all possible words that might come next but will choose from only the top 20%.
Temperature controls randomness, so a low temperature is less random (deterministic), while a high temperature is more random.
More technically, a low temperature makes the model more confident in its top choices, while temperatures greater than 1 decrease confidence in its top choices. An even higher temperature corresponds to more uniform sampling (total randomness). A temperature of 0 is equivalent to argmax/max likelihood, or the highest probability token.
top_p computes the cumulative probability distribution, and cut off as soon as that distribution exceeds the value of top_p. For example, a top_p of 0.3 means that only the tokens comprising the top 30% probability mass are considered.
So, what I’m getting is, top_p shrinks or grows the “pool” of available tokens to choose from, the domain to select over. 1=big pool, 0=small pool. Within that pool, each token has a probability of coming next.
Temperature is then a sort of fuzz-factor on rolling the dice to choose a next token from the available pool that top_p provided. temp=1 pushes the dice to consider less likely tokens, while temp=0 says to more strongly favor the already most likely tokens.
top_p circumscribes the domain of choice (allowing or excluding less likely options to play at all) and temp says that within that domain, how closely do you want to go with the most likely choice.
I’m being deliberately imprecise here, but is that the basic gist of how the two interact?
No clue if this is right, but this is what ChatGPT said haha:
Sure! Here’s an explanation of the “top p” and “temperature” parameters in the style of Snoop Dogg:
Yo, what’s good? It’s Snoop Dogg here, bringing you the scoop on this wicked computer program called a language model. It’s like a super smart writing assistant that always knows what to say next.
So, the “top p” parameter is like a filter that controls how many different words or phrases the language model considers when it’s trying to predict the next word. If you set the “top p” value to 0.5, the language model will only consider the 50 most likely words or phrases that might come next. But if you set the “top p” value to 0.9, the language model will consider the 90 most likely words or phrases.
The “temperature” parameter is like a dial that controls how creative or predictable the language model is. If you turn the dial all the way down to the “cold” side, the language model will play it safe and stick to the things it knows how to write really well. But if you turn the dial all the way up to the “hot” side, the language model will get all creative and come up with lots of different ideas.
So if you want the language model to be more like me, turn the temperature dial up and let it get crazy. If you want it to play it safe and stick to the things it knows how to write really well, turn the temperature dial down.
The “top p” and “temperature” parameters are like a one-two punch for controlling how the language model generates text. You can use them together to fine-tune the output and get the perfect mix of creativity and predictability.
You feel me? Let me know if you’ve got any questions!