Possible bug? Nondeterministic logprobs with echo=True, max_tokens=0

fritzo · April 23, 2023, 2:49pm

Hi thanks for the great API!

I’m trying to compute the probability of a given string using the recommended trick .create(logprobs=0, max_tokens=0, echo=True, ...), but I’m seeing nondeterministic results.

Can openai devs confirm they are doing some unbiased stochastic estimation under the hood, and this is not a bug or API mis-usage?

Expected behavior

I’d expect the logprobs to be deterministic.

To reproduce

import openai
logp = openai.Completion.create(
    model="text-davinci-003",
    prompt="Is this a bug?",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,  # just to be safe
)["choices"][0]["logprobs"]["token_logprobs"]
print(logp)

which gives me different values each time I call it, e.g.

[None, -4.274783, -1.3748269, -3.629216, -0.9771994]
[None, -4.2751837, -1.374485, -3.6267009, -0.97600585]

System information

Python 3.10.9
openai==0.27.4
tiktoken==0.3.3

sps · April 24, 2023, 4:21am

fritzo:

logp = openai.Completion.create(
    model="text-davinci-003",
    prompt="Is this a bug?",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,  # just to be safe
)["choices"][0]["logprobs"]["token_logprobs"]
print(logp)

I got:

eb · April 27, 2023, 1:44pm

I just noticed the same thing. I initially thought logprobs were deterministic, because when n>1, they are deterministic. However, with multiple copies of the same prompt, they aren’t deterministic, so you can generate contradictory results using a single api call:

completion = openai.Completion.create(
    engine='text-davinci-003',
    prompt=['Today is'] * 2,
    max_tokens=1,
    n=2,
    logprobs=2,
)

from pprint import pprint
for c in completion.choices:
    pprint(c['logprobs']['top_logprobs'])

printed

[{' a': -1.8103094,
  ' the': -2.1339836}]
[{' a': -1.8103094,
  ' the': -2.1339836}]
[{' a': -1.8117326,
  ' the': -2.1332269}]
[{' a': -1.8117326,
  ' the': -2.1332269}]

What’s more, some of the older models have different behavior patterns. ada appears to be consistently deterministic, whereas babbage appears to be deterministic for both n>1 and multiple copies of the same prompt within an API call, but babbage is not deterministic across API calls.

I did some quick testing iterating over API calls and collecting values, and it looks like some API model names (babbage, maybe davinci) might have a fixed set of different backend models that they consistently use, and there’s a mapping under the hood between user requests and backend models.

from collections import defaultdict
d = defaultdict(list)

for _ in range(10):

    completion = openai.Completion.create(
        engine='babbage',
        prompt=['My favorite food is'] * 1,
        max_tokens=1,
        n=1,
        logprobs=2,
    )

    for c in completion.choices:
        for k, v in c['logprobs']['top_logprobs'][0].items():
            d[k].append(v)

pprint(dict(d))

printed

{' a': [-3.6145434,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538],
 ' pizza': [-3.2944465,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687]}

If this behavior isn’t a bug, it should at least be documented somewhere. This should also affect other API behavior too, right? If different backend models have different logprobs, they will generate different distributions of output text.

Topic		Replies	Views
Logprobs keep changing when using the same prompt in chat.completion API api	3	1457	March 5, 2024
Difference in token log probabilities when `echo` is `True` vs `False` Prompting	6	3184	December 21, 2023
Possible bug? gpt-3.5-turbo non-deterministic even with temperature zero API	4	4593	December 21, 2023
Observing discrepancy in completions with temperature = 0 API	9	17389	February 6, 2024
Non-deterministic probabilities for first generated token in chat.completion? API	4	851	April 24, 2024

Possible bug? Nondeterministic logprobs with echo=True, max_tokens=0

Expected behavior

To reproduce

System information

Related topics