Questions regarding API sampling parameters (temperature, top_p)

Hi. I have a few questions regarding the API parameters which determine the sampling of the tokens. In production I use gpt-3.5-turbo exclusively. However to access token probabilities I sometimes use text-davinci-003 in the playground. I’d be very happy if someone can help me understand a little better what is happening.

  1. Default behavior.
    If I set top_p and temperature to 1 (default values), then the sampling will just pick a random token according to the probability. So if there are two tokens, token1 with 79% and token2 with 20% and the rest of the tokens share the last percentage and I do the same completions multiple times, then I’d (roughly) expect to see token1 4 out of 5 times and token2 1 out of every 5 completions. Is this correct?
  2. The actual temperature value.
    The temperature value in the API seems not to be the value actually used for the (re)assignment of probabilities. To my understanding if you use temperature the probability of every token is modified by ^(1/temperature_value) and then normalized again by softmax. Alternatively you apply the temperature directly when normalizing but this doesn’t seem to be the case since the probabilities displayed in the playground are not sensitive to temperature, so they’ve to be calculated prior to it’s application. Essentially a temperature between 0 and 1 should increase the probability of already probable tokens and a temperature >1 increases the probability of formerly less probable tokens. Very high temperature values lead to a nearly uniform probability distribution over the tokens, so basically you choose a random token.
    If you use the temperature value in the API and set it to 2 you seem to get just random tokens. In the playground you can even observe this directly (using text-davinci-003). Set the temperature to 2 and the output contains tokens with logprob of -15 and lower. If the temperature is applied as described above (or similarly) it must be higher than 2 or otherwise it wouldn’t cause this level of randomness. Basically I’d like to understand which functions modify the API parameter to get to the temperature values which are actually used or at least how the temperature parameter of the API scales (it doesn’t seem to scale linearly).
  3. Order if both parameters are modified.
    From what I observed in the playground it looks like the top_p parameter is applied first if both parameters are used. So if I choose top_p=0.99 and temperature 1.7, then top_p is applied first (basically throwing out nearly all of the tokens) and then the probability of the remaining tokens is recalculated based on the temperature. Since the order of application makes a big difference concerning the tokens which can be sampled at all this would be super interesting. Can somebody confirm that top_p is applied first?
  4. Setting one of the values to 0.
    If either top_p=0 or temperature=0 are used the model will just output the most probable token, is this correct?

Looking forward to your answers. Thank you!

2 Likes

Top-p reduces the selectable tokens from the 100% probability space at 1.00 down to only those that include the top 20% (inclusively) at 0.20. It is not affected by temperature.

This will make probabilities like:
“banana” : 30%
“apple”: 20%
“orange”: 15%
(all others): 35%

into a limited set of logits such as:
“banana” : 66.66%
“apple”: 33.34%

Temperature then acts, for example lowering the temperature gives more distance between choices:
“banana” : 75%
“apple”: 25%

(temperature works on base “e” logarithmic probabilities, so the exact magnitude of the effect is not intuitive.)

Then the selection process will choose by their certainty, 25% of the time choosing apple.

Yes, top_p can reduce 100000 tokens to 30 even at a setting of 0.99, because it is calculated as if the probability of each top token was added until the total probability of each was greater than top_p, and then the inclusion stops. The probability of alternates gets low pretty quickly when writing structured English.

top_p = 0 is the most reliable for deterministic output, since temperature is floating point math into probabilities where 0 temperature is not defined and method must be replaced anyway. The first token then any second token should never have a probability below 0.

2 Likes

*Citation needed.

If you have evidence of this, do share it, otherwise it would be better to state this as your speculative opinion.

2 Likes
1 Like

Where are you accessing the GPT-2 API?

1 Like

Just stop if you don’t have something to help OP. We obviously don’t have access to the inner workings of closed source models.

I already gave plenty of examples that allow one to deduce when a top_p or temperature type of selection is being made in unclear situations.

1 Like

Maybe in practice they would be the same but temperature of 0 (greedy) would be the proper way of doing it. This is most likely why they mention it in the API reference for temperature and not top_p.

You’re just adding extra work using top _p for a deterministic output.

I’d say it’s like asking a friend to recommend you a movie. The first one they say, you watch (temperature 0)

The second one is asking your friend to give you a probability listing (based on the chance of you liking it) of each movie that they’d recommend to you, and then still choosing the first one anyways cuz you’re (not you literally, just this figurative person) is a jerk (top_p crazy low)

1 Like

The thread I just linked is titled " Why the API output is inconsistent even after the temperature is set to 0", and even OpenAI documentation has said in the past that 0 still has a character of non-determinism, so using the more reliable and clear parameter when one should be used would make sense.

2 Likes

Yes. In the past. The documentation also states that you should change either top_p or temperature, not both.

The reason that temperature 0 is not deterministic is a result of MoE (I believe)

1 Like

User

Two methods and practices are assumed to achieve the same goal.

  1. method one does not achieve the goal
  2. it is unknown if method two achieves the goal

Implement the best choice.

ChatGPT

Given the information provided, it seems that method two is the better choice to implement, even though its effectiveness in achieving the goal is unknown. This is because method one has already been established as ineffective in achieving the goal. While method two’s success is uncertain, it still presents a chance of achieving the goal, whereas method one does not.

Therefore, it would be more reasonable to implement method two and gather data or evidence to determine whether it can achieve the goal or not. This approach maximizes the potential for success compared to using a method that is already known to be ineffective.

Once one understands how the two parameters interplay, as I described in the first post, then one can use an intermix creatively, and there is no prohibition on using both at the same time.

I would encourage considering some top_p reduction, because that removes the lottery-winning 0.001% token choices that make no sense or break code.


A setting of the two parameters that is disparate and initially non-intuitive such as temperature = 2, top_p = 0.25 can satisfy creative writing desires - while making code output formatting of that same creative writing less breakable.

AI output? {“poem_composition”, “The beauty of nature is a sight to behold\n
A sight that’s so pure and so grand…”

When “{” = 50% by your instruction quality, there is no other possible logit to break your JSON. However creativity that doesn’t go off the rails is still possible, including:

sight = 10.91%
wond = 5.19%
wonder = 4.48%
place = 2.64%
tranquil = 2.62%

And high temperature gets “creative” by making them all similar probability of being chosen, when there is this ambiguity allowed.

1 Like

This is an interesting thought. I have never tried this myself before and can’t say anything. I just wanted to say that for a deterministic output temperature should be used and if we’re going off of documentation then OpenAI does suggest (generally, so maybe your case is useful) not to use both.

Although in reality it probably doesn’t matter (besides using more resources)

My biggest concern with your reason (mixing creativity with object creation) is that top_p of 0.25 would eliminate some of the creativity and that for a higher quality result it may be better just to have a high temperature, untouched top_p, and then create the object with new settings.

First of all, thank you for your answers!

I believe it makes sense to implement both top_p=0 and temperature=0 as greedy sampling. Mathematically it is the obvious interpretation in both cases. So far I’ve worked with temperature=0 but not with top_p=0. I can say that it seemed to do greedy sampling in my cases.

@_j Your first example suggests that top_p is applied before the temperature. I did a few experiments on the playground with text-davinci-003 and came to the same conclusion. However I’m looking for more definitive confirmation. Do you know if this was confirmed by openAI for gpt-3.5-turbo?

Also what you described in your last comment is close to my use case. Not the creative writing part but the parameter settings. I use top_p=0.95 to cut off the improbable tokens and use high temperature settings (like 1.7 or 2.0) to get a more uniform probability distribution over the tokens.

And concerning the general discussion of top_p and temperature parameter and using them both. I believe that for 2 as temperature value it doesn’t make sense to not modify top_p. Try the following two set-ups in the playground with model text-davinci-003 to see the difference.
Starting text: “Write an essay on a topic of your choice:”
With temperature=2 and top_p=1:

Ant Xi - thee Journey of pioneering trends in electronic music

For outline enthusiasts belongingigon skyline founder per cellphone disturbing hotel Ablebot JunoPlus challenges FireStock multip Exine choir swathWitches Aubision amounts The Hall European Fluilar Guan hackskeleton sons leapt Ryzen reverberorted Indigo unrecognised locale compute unpublished generations encourage Via encro Neither Wallords hovering binaries servessional pint Network Enix Marhyd personality decree Unt libertarians fors

-872 logprob for the full text

With temperature=2 and top_p=0.98:

In an era in which electronic information exchange dominates interpersonal communication, there is an argument to be made for the importance of hand-written letters. If emails and texting can send the same message more quickly and without delay, why would someone take the extra time to find stationery and carefully craft a message by hand? Despite these realities, hand-written letters remain an important tradition even in our technological age for several reasons.

-175 logprob for the full text

Concerning my question with the temperature and how it scales I’m still not convinced. Can you elaborate on your previous answer? I’m not afraid of the math, in fact I’d like to understand the math behind it. I know how temperature works if you use it directly when normalizing probabilities with softmax. But if you used 2 as temperature there it wouldn’t explain the level of randomness you observe when using this value at the openAI API.

I believe the idea behind not mixing temperature and top_p is from the loss of control. To me, it adds an unnecessary layer of complexity & doesn’t follow instructions well.

Using your example, but instructing (and not allowing a free-write) Davinci gives bad results almost immediately (temp: 2, top_p: .98)

Write an essay regarding the design of the batmobile:

The iconic design of the batmobile has played a major role in determining the legacy of one of pop-culture’s biggest super-heroines – the Batman. While Bruce Wayne’s debut ‘BAT’-pod featured only a one-night street cred at the 1965 Flashion Farms race, the roaring success and usage of its trisected ensuing course grade race inspired Adam West to attach wing magnets and booster an irreversible boom for his bird emblem costume shop emporium to produce direct scaled up match brigade shells into widespread civilian spirited tradition
Generated fascination elsewhere, manifested design guzzling egg engines to wire whalemen outer island odd titles this brig triple mod devoted quint dog eyes approach “the Legendary Classics”! 5 design beliefs continued level cumulative growth adaptive parts supplier bike gasoline shoop franchise poster folder vast shapes powerful product scan into full comprehensive economy circle elforge engine platforms integrating town industry. Fundamental sleight developed composed grandement flagship progressive supply curvacious competitive club ever user mechanics seeking inspirational constructive parts diversions design fashion batmong engines biggest hazard silver suit felt architecture rooftop past sets unprecedented amazement unified mastery excellence structural theire embyte manipulation excess car world trademark multiple line caliber private turret capture distorn high horn curves and iron work meta instinct business tool materials conservation advances achievements

I can see what you’re trying to say. Yet, being practical I can accomplish the same effect simply by using a lower temperature:

Write an essay on a topic of your choice:

In our recent history, there has been a growing trend of optimism and hope in the pursuit of improvement and progress. From industry moguls to everyday individuals, more and more people challenge their negative thoughts and prioritize a positive attitude when taking on various tasks. The people behind positive thinking understand that often times our biggest strength lies in how we choose to approach a situation — the power of positive thinking.


Write an essay regarding the design of the batmobile:

The batmobile is probably the most recognizable vehicle in pop culture – and for good reason. Since its inception in 1939, the batmobile had been evolving to reflect the changing tastes of Batman’s numerous fans, while retaining its trademark gothic aesthetic. Throughout the years, the batmobile has made countless appearances in comics, movies, and television, its image morphing over time until it became one of the most iconic symbols we associate with superhero culture today.


For fun I wanted to try a “fantasy world”. Maybe this high temp top_p hack would benefit it?

Temp: 2, top_p: 0.98

AN EPIC BATTLE BETWEEN GOBLIN & OGRE:

The crisp and dewy morning sun suddenly explodes over the horizon illuminating the green pastures that sat deep in the valley floor.
A fierce battle begins: a horde of goblin infantry face off against a pair of almighty ogres.
The ogres pick up a number of large rocks which they proceed to hurl towards the oncoming line of goblins. Two are immediately struck down from the withering attacks of the massive fiends, unable to fight off such enormous odds.
Their fallen comrade’s war-cry remains a taunt for the still present allies who further intensifying their relentless roar as they blaze forth despite the surprise of the massive wall of titans leading the charge against them.
Dressed in shimmering brown suits of armor, in line formations fourfold thick, each goblin screams and bangs their humble warrior truncheons menacingly against what dwarf they against the elevated aura of determination who flooding with impressive entropy engulfing all who inhabit

Temp: 1.1, top_p: 1

AN EPIC BATTLE BETWEEN GOBLIN & OGRE:

The air seemed still as the goblin and the ogre squared off in a barren valley, both ready for a fight to the death. The goblin held a crudely made spear while the ogre carried a club the size of a large tree trunk.
The goblin wasted no time and lunged forward, jabbing his spear at the ogre’s gut. But the ogre was ready and stepped aside out of the way of the attack. The goblin then took a daring move and leapt atop the ogre and began furiously stabbing at the beast’s back. While the ogre could not reach the nimble goblin, he roared and shook his body, attempting to dislodge his foe.
But the goblin clung tightly to the ogre’s back as he thrust his spear over and over again. The ogre, his skin now red with blood from the savage attack, swatted wildly behind him in a last ditch effort to save himself. Then, with a lucky strike, the ogre

I gave both of these to ChatGPT4 to rate, take it as you will:

  1. Setting Description: How well does the piece create an atmosphere or sense of place? (1-5)
  2. Character Actions: Are the actions of characters clearly described and logical? (1-5)
  3. Engagement Level: Does the piece keep the reader’s interest? (1-5)
  4. Writing Style: Is the writing clear, free of errors, and easy to understand? (1-5)
  5. Conflict: Is there a compelling conflict or tension? Does it have a resolution? (1-5)

[REASONING IS HERE]

Ratings:

PIECE 1:

  1. Setting Description: 5
  2. Character Actions: 3
  3. Engagement Level: 4
  4. Writing Style: 2
  5. Conflict: 4

Total: 18

PIECE 2:

  1. Setting Description: 2
  2. Character Actions: 5
  3. Engagement Level: 5
  4. Writing Style: 4
  5. Conflict: 4

Total: 20

So, dang. I think the high temperature one is nicer. I’m guessing that’s why they say generally they shouldn’t be mixed. Thanks. Guess I’m going to be mixing my values now and trying this out for creative writing!

1 Like

We cannot receive log probabilities for gpt-3.5-turbo, so we don’t get the exact logit values to easily demonstrate that a top token of 35% will be the only one possible at any temperature until top_p is increased beyond 0.35.

One would have to come up with a prompt scenario with close (unseen) top probabilities (“flip a coin, strict output formatting”), binary search at temperature 2 to discover the threshold where only one or two becomes allowed, allow two and see that lowering temperature a bit cannot eliminate the alternate.

This is complicated by the target token in chat model being hard to prompt as the very first token produced, reproducible at max_tokens=1.

2 Likes

@anon22939549 Note (b) for Financial Systems: A New Kind of Science | Online by Stephen Wolfram [Page 1014]

In what way do you imagine Zipf’s law is relevant to this discussion?

As per my understanding, top_p and temperature both work on logits and not on softmax-ed probabilities.
In case of temperature, logits are scaled by 1/Temp using below formula
sigma(z) = e^ z/T/ Summation(0 to K) e^z/T.
So in case of smaller temp, logits will be divided by temp and logits value will increase and softmax probability of already high probability tokens will further increase and probability distribution will come more skewed towards high probability tokens whereas in case of high temp value, probability distribution will flatten and get evenly distribution and more tokens will have similar probability and chances of randomness increases

In case of top_p, it reduces the tokens on which softmax will work. So lets say at top_p =1 all tokens are considered then at top_p = 0.2 , we will sort the voacb tokens in descending order of logits and only those tokens will have cumulative value of 0.2 will be taken and then softmax will be applied on them