I just noticed the same thing. I initially thought logprobs were deterministic, because when n>1
, they are deterministic. However, with multiple copies of the same prompt, they aren’t deterministic, so you can generate contradictory results using a single api call:
completion = openai.Completion.create(
engine='text-davinci-003',
prompt=['Today is'] * 2,
max_tokens=1,
n=2,
logprobs=2,
)
from pprint import pprint
for c in completion.choices:
pprint(c['logprobs']['top_logprobs'])
printed
[{' a': -1.8103094,
' the': -2.1339836}]
[{' a': -1.8103094,
' the': -2.1339836}]
[{' a': -1.8117326,
' the': -2.1332269}]
[{' a': -1.8117326,
' the': -2.1332269}]
What’s more, some of the older models have different behavior patterns. ada
appears to be consistently deterministic, whereas babbage
appears to be deterministic for both n>1
and multiple copies of the same prompt within an API call, but babbage
is not deterministic across API calls.
I did some quick testing iterating over API calls and collecting values, and it looks like some API model names (babbage
, maybe davinci
) might have a fixed set of different backend models that they consistently use, and there’s a mapping under the hood between user requests and backend models.
from collections import defaultdict
d = defaultdict(list)
for _ in range(10):
completion = openai.Completion.create(
engine='babbage',
prompt=['My favorite food is'] * 1,
max_tokens=1,
n=1,
logprobs=2,
)
for c in completion.choices:
for k, v in c['logprobs']['top_logprobs'][0].items():
d[k].append(v)
pprint(dict(d))
printed
{' a': [-3.6145434,
-3.613538,
-3.613538,
-3.613538,
-3.613538,
-3.613538,
-3.613538,
-3.613538,
-3.613538,
-3.613538],
' pizza': [-3.2944465,
-3.299687,
-3.299687,
-3.299687,
-3.299687,
-3.299687,
-3.299687,
-3.299687,
-3.299687,
-3.299687]}
If this behavior isn’t a bug, it should at least be documented somewhere. This should also affect other API behavior too, right? If different backend models have different logprobs
, they will generate different distributions of output text.