Possible bug? Nondeterministic logprobs with echo=True, max_tokens=0

Hi thanks for the great API!

I’m trying to compute the probability of a given string using the recommended trick .create(logprobs=0, max_tokens=0, echo=True, ...), but I’m seeing nondeterministic results.

Can openai devs confirm they are doing some unbiased stochastic estimation under the hood, and this is not a bug or API mis-usage?

Expected behavior

I’d expect the logprobs to be deterministic.

To reproduce

import openai
logp = openai.Completion.create(
    model="text-davinci-003",
    prompt="Is this a bug?",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,  # just to be safe
)["choices"][0]["logprobs"]["token_logprobs"]
print(logp)

which gives me different values each time I call it, e.g.

[None, -4.274783, -1.3748269, -3.629216, -0.9771994]
[None, -4.2751837, -1.374485, -3.6267009, -0.97600585]

System information

Python 3.10.9
openai==0.27.4
tiktoken==0.3.3

1 Like

I got:

I just noticed the same thing. I initially thought logprobs were deterministic, because when n>1, they are deterministic. However, with multiple copies of the same prompt, they aren’t deterministic, so you can generate contradictory results using a single api call:

completion = openai.Completion.create(
    engine='text-davinci-003',
    prompt=['Today is'] * 2,
    max_tokens=1,
    n=2,
    logprobs=2,
)

from pprint import pprint
for c in completion.choices:
    pprint(c['logprobs']['top_logprobs'])

printed

[{' a': -1.8103094,
  ' the': -2.1339836}]
[{' a': -1.8103094,
  ' the': -2.1339836}]
[{' a': -1.8117326,
  ' the': -2.1332269}]
[{' a': -1.8117326,
  ' the': -2.1332269}]

What’s more, some of the older models have different behavior patterns. ada appears to be consistently deterministic, whereas babbage appears to be deterministic for both n>1 and multiple copies of the same prompt within an API call, but babbage is not deterministic across API calls.

I did some quick testing iterating over API calls and collecting values, and it looks like some API model names (babbage, maybe davinci) might have a fixed set of different backend models that they consistently use, and there’s a mapping under the hood between user requests and backend models.

from collections import defaultdict
d = defaultdict(list)

for _ in range(10):

    completion = openai.Completion.create(
        engine='babbage',
        prompt=['My favorite food is'] * 1,
        max_tokens=1,
        n=1,
        logprobs=2,
    )

    for c in completion.choices:
        for k, v in c['logprobs']['top_logprobs'][0].items():
            d[k].append(v)

pprint(dict(d))

printed

{' a': [-3.6145434,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538,
        -3.613538],
 ' pizza': [-3.2944465,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687,
            -3.299687]}

If this behavior isn’t a bug, it should at least be documented somewhere. This should also affect other API behavior too, right? If different backend models have different logprobs, they will generate different distributions of output text.

1 Like