Logprobs inconsistent between runs for 4o

samuel.kolb · September 10, 2024, 2:52pm

If I send the exact request multiple times (asking the model to reply with “Yes” or “No”), and ask for logprobs I get wildly different answers for each request with 4o but not with 4 or 4-turbo. Any idea why?

{
    "model": "gpt-4o",
    "messages": [...],
    "temperature": 0,
    "top_p": 1,
    "logprobs": True,
    "top_logprobs": 2,
    "functions": None,
    "function_call": None,
}

_j · September 10, 2024, 8:24pm

All AI models that OpenAI currently runs are non-deterministic. You give them the same input, they return different logprob values each time.

This variation is even higher in the newest models. You can specify either 0 or miniscule small values for top_p and temperature, and get answers that diverge pretty quickly.

Temperature 0 is not as good as temperature 0.0000000001 for some reason. top_p at an extremely low number is a stronger enforcer of only getting the top value back.

The answer overall though is perplexity. The less expensive AI is less clear and certain how to score tokens (unless it is a particular post-training chat behavior), and so the values of logprobs end up being closer together and easy for one to overtake another between runs.

Asking for logprobs doesn’t change the behavior. It does let you see how close “Yes” was to “No”, though, or to “yes” or “I’m sorry”. That can give you insight that you need to make your prompting and desired output clearer, or that the AI just has no good truthfulness score for you from the facts.

samuel.kolb · September 11, 2024, 2:16pm

Thanks for the answer, interesting to read. If I understand you correctly, setting top_p to a very small number might cause more deterministic behaviour than setting it to 1 as in my current setup?

_j · September 11, 2024, 2:45pm

top_p is where “p” stands for probability mass cutoff of the ranked multinomial input set of all logits that come from the inference model, the goodness value of each token in the BPE encoding dictionary being the output, each token evaluated by the softmax layer and normalized into a probablity by the total of all logits then adding up to 1.0, or 100%.

A top_p value of 1.0 allows all through to the next stage of biasing by temperature. A value of 0.5 is 50% of the probability, where the most certain logits would be added until up to 0.5 probability is reached and no more, a cutoff that in most cases just gives a few tokens of choice if the AI is pretty certain what to write.

In your case where you told the AI the binary choice of what to produce, and the instruction-following and how to write the output is what produces uncertainty beyond the specification, you’ll likely get top values like “true”: 20%, “false”: 19%, “True”: 4%, “Sure”: 2%… and the listing continues for 200000 tokens, 20 of which you can observe with logprobs.

set top_p:0.1 in that case, only “true” can be output.
set top_p:0.4, and the first three are considered randomly according to their weight to give a discrete probability distribution.
set top_p:0.0000001 and it becomes mathematically impossible to have a second rank token.

So, with setting 1, you are basically turning off any function of this API parameter, which is the default when not specified.

Set it to 0, you get the indeterminate model’s best production path for a run, by turning off sampling.

three passes of rewriting through different models - at top_p 0

Understanding Nucleus Sampling (Top-p) in AI Language Models

Nucleus sampling, commonly referred to as top-p sampling, is a method used in AI language models to manage the diversity of generated text. The “p” in top-p denotes the cumulative probability threshold that determines which subset of tokens the model considers when generating the next word in a sequence.

In language models, each token is assigned a probability through a softmax layer, which converts the raw output scores (logits) from the model into a probability distribution that sums to 1.0, or 100%. The top-p parameter establishes a cutoff in this cumulative probability distribution. For example:

A top-p value of 1.0 includes all tokens, allowing the model to consider the entire probability distribution for the next word. This setting is equivalent to applying no cutoff and is typically chosen to maximize diversity in the generated text.
A top-p value of 0.5 means that the model considers only the most probable tokens until their cumulative probability reaches 50%. This usually results in a more focused selection of tokens, particularly when the model has high confidence in its predictions.

To illustrate, imagine a model tasked with generating a binary response. The probabilities for potential tokens might be: “yes” (20%), “no” (19%), “maybe” (4%), “possibly” (2%), etc. If top-p is set to 0.1, only the token “yes” would be selected, as it alone exceeds the 10% threshold.

Setting top-p to 0.4 would include the first three tokens (“yes”, “no”, and “maybe”), allowing the model to randomly select among them based on their probabilities.
A very low top-p value, such as 0.0000001, would restrict the model to the single most probable token, effectively eliminating diversity in the output.

When top-p is set to 1, the parameter is effectively neutralized, permitting the model to utilize the full probability distribution. Conversely, setting top-p to 0 would theoretically disable sampling, compelling the model to always select the most probable token, which can result in repetitive and predictable outputs.

Adjusting the top-p parameter enables users to finely tune the balance between creativity and precision in AI-generated text, making it possible to customize the output to meet specific needs and contexts. This flexibility is crucial for applications requiring a particular style or level of inventiveness in text generation.

samuel.kolb · September 11, 2024, 2:59pm

Thanks for your helpful answers!

Topic		Replies	Views
Logprobs and message.content are inconsistent API gpt-4 , api , logprobs	6	1648	April 11, 2024
Why does the answer vary for the same question asked multiple times Community api	8	2111	May 22, 2024
Achieving deterministic API output on language models - HOWTO API statistics	3	8253	December 21, 2023
Why the API output is inconsistent even after the temperature is set to 0 API gpt-4	11	22807	December 21, 2023
Non-deterministic probabilities for first generated token in chat.completion? API	4	837	April 24, 2024

Logprobs inconsistent between runs for 4o

Related topics