The Relationship between Best of, Temperature and Top P (The Three Variable Problem)

I’m trying to get a better understanding of “best of”. Although I suspect it can be helpful in some cases, I’m not sure whether this is actually true or not, and was hoping someone who understands it better can help explain.

The part I can’t get my head around is the difference between using BO to find the most probable of x iterations and just running the prompt with temperature set at zero. Is “temp = 0” not just another way of saying return the response with the highest log probability?

I did some experimenting with davinci using limericks, since they utilize structure while being notoriously difficult to compose. Here are my results. The three numbers before each completion are temp, top p, and BO. Some experiments were repeated more than once.

Prompt:
Limerick about limericks: The limerick packs laughs anatomical
Into space that is quite economical.
But the good ones I’ve seen
So seldom are clean
And the clean ones so seldom are comical.

Limerick about a couple: There was a young lady of Norway
Who hung by her toes in a doorway.
She said to her beau
Just look at me Joe,
I think I’ve discovered one more way.’

Limerick about food: My favorite food is baloney
Though some people think it is phony
Despite what they say

0 1 1
It’s still the best way
To make a sandwich, in my opinion!

0 1 2
It’s still the best way
To make a sandwich, in my opinion!

0 .5 1
It’s still the best way
To make a sandwich, don’t you agree?

.7 1 1
It’s still the best way
To start off each and every day!

It’s still good to eat
And I’ll continue to eat it with glee.

It’s still the best way
To make a sandwich, with or without mayo!

It tastes great any day
And I’ll eat it as long as I live

It tastes great any day
And I’ll eat it as long as I live

.7 1 5

It’s still yummy today
And I’ll still eat it with bolognaise.

It’s still yummy to me
And I’ll eat it every day if I want to!

.7 .5 3
It’s still the best way
To make a sandwich, don’t you agree?

It’s still the best way
To make a sandwich, don’t you agree?

It’s still the best way
To make a sandwich, in my opinion!

The 0 1 2 experiment was just a sanity check to make sure I wouldn’t see variation when temp was set to 0. After setting the BO to a rather high (and expensive) number 5, I thought it was interesting to see the word yummy show up twice, even though it didn’t show up in the .7 1 1 round. ( I do like the word “yummy” in this context, and I could probably be convinced it’s a “better” word, although the completion lines weren’t particular noteworthy).

Interestingly, dialing top_p down to .5 with a BO of 3 resulted in two iterations of the 0 .5 1 result, and one iteration of the 0 1 1 result. Can anyone help explain why this would be true?

Non-technical opinions and observations as to whether a non-zero temp and BO > 1 can generate better results than temp = 0 and BO = 1 are welcome!

2 Likes

Along similar lines, if “best of” simply means “most probable of”, then it’s not really “best of”, is it? “Best of” implies some external yardstick of quality.

3 Likes

@nimblebooksllc Indeed!

@PaulBellow any chance you could shed some light on this question?

I too am wondering how to use best_of best. Although I thought it would be related to n.

best_of tells GPT to generates multiple responses at its end

Once it has done this, it ranks them based on quality and return the number you have in the n setting

So best_of will always be more than (or the same as) n

It is really there so you can set a high temperature and the AI can go crazy, and then get it to pick the best score (I assume it compares the embedding vector of the prompt with the embedding vectors of the completions and takes the highest value).

As far as I can tell, you get billed for the best_of tokens (even though you don’t see the responses) I could be wrong there

So don’t use best_of=10 and n=1 unless you understand this

Edit: Confirmed that you are billed for the best_of tokens. Make sure you set a stop value and/or max_tokens. But max_tokens will end up being multiplied by best_of

2 Likes

I concur with @raymonddavey’s knowledgeable answer.

I personally don’t use it currently.

1 Like

Quality based on what measure? This is the root of my question. If it’s quality based on likelihood, then you can achieve the same thing with a zero temp. I haven’t seen an answer yet for what quality means in objective terms.

best_of

Generates best_of completions server-side and returns the “best” (the one with the highest log probability per token). Results cannot be streamed.

When used with n, best_of controls the number of candidate completions and n specifies how many to return – best_of must be greater than n.

Note: Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for max_tokens and stop.

I note that “best” is in quotes, but they say it’s the highest log prob per token…

While for ‘n’ there’s no mention of sorting, so maybe it’s random? And that would be the difference?

Although I agree that best_of = 1 and n = 1 would be the same as temperature = 0 and n = 1…

But I could be wrong!

n

How many completions to generate for each prompt.

Note: Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for max_tokens and stop.

Thinking about it more, I can see how it might be confusing.

https://beta.openai.com/docs/api-reference/completions/create#completions/create-n

1 Like

Without seeing the openai source code for the engine, this is my best guess:

We already know that each token that makes up a completion has a logprobs value. This is on a log scale. (the numbers are normally less than zero, ie. negative, and the closer to zero they are, the higher the score or probablity)

We can see the top 5 token log scores for any completion.

Fun fact : If you take a logprobs score and use it value as an exponent to e (eg e to the power of the logprob score), you will get a percentage. Lower logprob scores gives higher percentages

Anyway, now we know that, if we take the logprob score for each token that makes up the first completion and do something with them (I assume multiply them with each other), we get a final score for the entire completion

Then if we do it again for the next completion, we get another score for the second completion

When we have done best_of completions, we can sort by the overall scores for each completion and take the values closest to zero

Because they have the best “overall” logprobs value

Quick thought : Because we are dealing with negative numbers, maybe it multiplies by the absolute value. However, that is beside the point because it basically doe some form of calculation on the logprob score of each token to get a final score for the entire completion

And then it compares them to each other to pick the best ones

So that’s my best guess. I think I will give that explanation a logprob score of -0.01 :slight_smile: or e^0.01 = 99%

2 Likes

I mostly didn’t understand that. But that’s on me. I may have to re learn some maths (or math to those in the US)

1 Like

First it’s important to recall (or clarify) a few things:

  • Completions are selected by adding tokens one at a time based on their computed likelihoods (expressed as logprobs) . Crucially this is were the randomness of generated completions comes in: token selection is not deterministic (even though the logprobs for a given prompt are).
  • Despite being fixed for a given prompt and output token, logprobs may vary slightly as a result of slight numerical inaccuracies accumulating in the numerous floating-point operations involved in computing them, so they may differ slightly between runs). This is not where the randomness of completions comes from.
  • When n or best_of are provided as arguments you’re effectively invoking the API max(best_of, n) times (with n and best_of omitted), to generate max(best_of, n) completions, from which n are selected.

With that in mind, here’s how it works: The n selected are simply the ones for which the sum of all the logprobs is highest , i.e. the most probable overall completions (not the ones with the highest logprob “per token”, as stated in the docs, whatever that would mean).

Where temperature comes in is simply that it flattens the probability distribution for output tokens, thus making less likely tokens more likely to be selected, at the expense of more likely ones.

And top_p is just an alternate way of tweaking the token probabilities, by gabbing only the most likely tokens whose total probability adds up to at least p, and then choosing from those (using a renormalized PDF just over that subset).

(Also, note — as the close reader will have discovered — that, simply as a convenience for human readability, the logprobs reported by the API are for words, not tokens.)

3 Likes