Difference in token log probabilities when `echo` is `True` vs `False`

I’m seeing a difference in the log probabilities for the same tokens in the response when echo is set to True vs False.

Notice in the example below that the log probability for the token John is -4.1053495 when echo=True and -4.104846 when echo=False.

Is this expected behavior? If so, why would we expect different log probabilities for the individual tokens for exactly the same prompt?

When echo is True:

Request:

{'engine': 'davinci', 'prompt': 'Hello, my name is', 'temperature': 0, 'n': 1, 'max_tokens': 10, 'best_of': 1, 'logprobs': 1, 'stop': None, 'top_p': 1, 'presence_penalty': 0, 'frequency_penalty': 0, 'echo': True}

Response:

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": {
        "text_offset": [
          0,
          5,
          6,
          9,
          14,
          17,
          22,
          23,
          25,
          28,
          30,
          41,
          51,
          52,
          54
        ],
        "token_logprobs": [
          null,
          -1.7002722,
          -3.0985389,
          -0.33346418,
          -0.05650585,
          -4.1053495,
          -1.9224068,
          -0.6926424,
          -1.4471763,
          -1.0555193,
          -3.2214253,
          -1.8482708,
          -1.0643281,
          -1.137746,
          -1.5790797
        ],
        "tokens": [
          "Hello",
          ",",
          " my",
          " name",
          " is",
          " John",
          ".",
          " I",
          " am",
          " a",
          " recovering",
          " alcoholic",
          ".",
          " I",
          " have"
        ],
        "top_logprobs": [
          null,
          {
            ",": -1.7002722
          },
          {
            " I": -2.4386003
          },
          {
            " name": -0.33346418
          },
          {
            " is": -0.05650585
          },
          {
            " John": -4.1053495
          },
          {
            ".": -1.9224068
          },
          {
            " I": -0.6926424
          },
          {
            " am": -1.4471763
          },
          {
            " a": -1.0555193
          },
          {
            " recovering": -3.2214253
          },
          {
            " alcoholic": -1.8482708
          },
          {
            ".": -1.0643281
          },
          {
            " I": -1.137746
          },
          {
            " have": -1.5790797
          }
        ]
      },
      "text": "Hello, my name is John. I am a recovering alcoholic. I have"
    }
  ],
  "created": 1642114600,
  "id": "cmpl-4Q3JIcJJLL8nVB0aJwc95qkkIO41x",
  "model": "davinci:2020-05-03",
  "object": "text_completion",
  "request_time": 1.2324142456054688
}

When echo is False:

Request:

{'engine': 'davinci', 'prompt': 'Hello, my name is', 'temperature': 0, 'n': 1, 'max_tokens': 10, 'best_of': 1, 'logprobs': 1, 'stop': None, 'top_p': 1, 'presence_penalty': 0, 'frequency_penalty': 0, 'echo': False}

Response:

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": {
        "text_offset": [
          17,
          22,
          23,
          25,
          28,
          30,
          41,
          51,
          52,
          54
        ],
        "token_logprobs": [
          -4.104846,
          -1.9145488,
          -0.69057566,
          -1.4444675,
          -1.0576655,
          -3.235984,
          -1.862587,
          -1.0654857,
          -1.1328539,
          -1.5772507
        ],
        "tokens": [
          " John",
          ".",
          " I",
          " am",
          " a",
          " recovering",
          " alcoholic",
          ".",
          " I",
          " have"
        ],
        "top_logprobs": [
          {
            " John": -4.104846
          },
          {
            ".": -1.9145488
          },
          {
            " I": -0.69057566
          },
          {
            " am": -1.4444675
          },
          {
            " a": -1.0576655
          },
          {
            " recovering": -3.235984
          },
          {
            " alcoholic": -1.862587
          },
          {
            ".": -1.0654857
          },
          {
            " I": -1.1328539
          },
          {
            " have": -1.5772507
          }
        ]
      },
      "text": " John. I am a recovering alcoholic. I have"
    }
  ],
  "created": 1642114681,
  "id": "cmpl-4Q3KbiQV1Jy1Vj7ikyRysoqYwjVPG",
  "model": "davinci:2020-05-03",
  "object": "text_completion",
  "request_time": 0.9291911125183105
}
1 Like

@tonyhlee my guess is with echo=True present, the prompt is returned in addition to the completion, and the conditional log probability changes as a result. This is speculation though, since the log prob of completion tokens should always be conditioned on the prompt.

Hi @tonyhlee :wave:,

The most plausible explanation seems to be in the docs:

  • The API is stochastic by default which means that you might get a slightly different completion every time you call it, even if your prompt stays the same. You can control this behavior with the temperature setting.

This would mean that the probability of each token appearing changes every time, hence the token_logprobs.

1 Like

I see. Thank you for confirming! So, it’s not a problem with echo=True vs. False. Do you know why we get slightly different log probabilities even when the temperature is set to 0, and the prompt is the same?

Hi @sps :wave:,

Thanks for the reply. In my example, I set the temperature to 0, so I would expect the same tokens and log probabilities.

@m-a.schenk good experiment, that rules out my original hypothesis.
@sps ‘stochastic’ in this context is referring to the sampling behavior of the decoder when outputting tokens (GPT3 uses nucleus sampling). This can be ruled out when you set the temperature to 0 as @m-a.schenk did.

@tonyhlee my best guess now is that there is some stochasticity during inference itself. Potential sources of stochasticity in GPT3 are: dropout, layer normalization, numerical errors, and parallelization.

Dropout is definitely used during training of GPT3, though often with a multiplier in its place during inference. If dropout was applied during inference as well, then this is the most likely source of noise.

Layer normalization can also add randomness to your model, based on the activation statistics of other inputs in your batch during inference. For base models, your input is usually batched with other inputs on OpenAI’s backend, so this could definitely introduce noise. Though, for finetuned models with only 1 request at a time, this would not be a likely culprit.

Numerical errors exist in all machine learning models, and when combined with randomness in parallelization on the gpu and across gpus, you could also accumulate some noise this way.

1 Like