Difference in "tokens" and "text" in response

Hello, I’m seeing a discrepancy between what I expect in the list of tokens and the text in the response.

This is the request:

{model='openai/davinci', prompt='Answer the following question about geography.\n\nQuestion: What is the longest river?\nAnswer: Nile ##\n\nQuestion: What is the tallest mountain?\nAnswer:', temperature=0, num_completions=1, top_k_per_token=5, max_tokens=100, stop_sequences=['##'], echo_prompt=False, top_p=1, presence_penalty=0, frequency_penalty=0}

This is the response I get back:

JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": {
        "text_offset": [
          149,
          155,
          163,
          164,
          164,
          164,
          164,
          164
        ],
        "token_logprobs": [
          -0.8430533,
          -0.05527358,
          -0.31007066,
          -0.07606475,
          -0.028943757,
          -0.2905347,
          -0.0031962388,
          -0.3996574
        ],
        "tokens": [
          " Mount",
          " Everest",
          " ##",
          "\n",
          "\n",
          "Question",
          ":",
          " What"
        ],
        "top_logprobs": [
          {
            " Chim": -3.9527752,
            " Everest": -1.1450402,
            " Kil": -2.322431,
            " Mount": -0.8430533,
            " Mt": -3.039987
          },
          {
            " El": -6.1829395,
            " Everest": -0.05527358,
            " Fuji": -5.5334053,
            " Kil": -4.095801,
            " Olympus": -5.4415402
          },
          {
            "\n": -2.0230033,
            "\n\n": -3.850575,
            " ": -4.441365,
            " ##": -0.31007066,
            " (": -3.6939626
          },
          {
            "\n": -0.07606475,
            "\n\n": -2.8826785,
            " ": -5.974212,
            " (": -5.575713,
            ".": -6.6401477
          },
          {
            "\n": -0.028943757,
            "<|endoftext|>": -4.7994246,
            "In": -7.8236175,
            "Question": -5.2021422,
            "The": -6.255388
          },
          {
            "Answer": -4.8792033,
            "In": -4.966905,
            "Question": -0.2905347,
            "The": -3.3670382,
            "This": -4.8384757
          },
          {
            " 1": -8.625377,
            " :": -6.9509153,
            " What": -8.843604,
            ".": -8.176841,
            ":": -0.0031962388
          },
          {
            " How": -3.2076557,
            " What": -0.3996574,
            " Where": -2.5891602,
            " Which": -2.8821936,
            " Who": -2.605403
          }
        ]
      },
      "text": " Mount Everest "
    }
  ],
  "created": 1641589200,
  "id": "cmpl-4Nqd6NOJkTI5hUBYxLgMTdeTMWDZT",
  "model": "davinci:2020-05-03",
  "object": "text_completion",
  "request_time": 1.4495737552642822
}

text in the response is correct (“Mount Everest”), but tokens is incorrect, as we get more tokens past the stop sequence ##:

    "tokens": [
      " Mount",
      " Everest",
      " ##",
      "\n",
      "\n",
      "Question",
      ":",
      " What"
    ],

I would expect to just get back [" Mount", " Everest"].

1 Like

Thank you for the response. Do you mind pointing me to the documentation that states this?

I’m still a bit confused because according to this documentation, “the Stop Sequence is an optional setting that tells the API when to stop generating tokens. The completion will not contain the stop sequence and you can pass up to four stop sequences”.