Flaky logprobs with gpt-4o

Logprobs are flaky with gpt-4o. Sometimes the logprob for the top token is coming back as -9999.0 and the top_logprobs do not align with what is actually generated. I tested this same problem with gpt-4 and gpt-4o-mini and it seems to work fine, so seems just to be a problem with 4o.

See below:

ipdb> response.choices[0].logprobs.content[20]
ChatCompletionTokenLogprob(token='no', bytes=[110, 111], logprob=-9999.0, top_logprobs=[TopLogprob(token='iveness', bytes=[105, 118, 101, 110, 101, 115, 115], logprob=0.0), TopLogprob(token='iven', bytes=[105, 118, 101, 110], logprob=-19.625)])
ipdb> response.choices[0].logprobs.content[28]
ChatCompletionTokenLogprob(token='yes', bytes=[121, 101, 115], logprob=-9999.0, top_logprobs=[TopLogprob(token=' Utility', bytes=[32, 85, 116, 105, 108, 105, 116, 121], logprob=0.0), TopLogprob(token='Utility', bytes=[85, 116, 105, 108, 105, 116, 121], logprob=-20.625)])

I’m trying to use LLMs to ā€˜evaluate’ some conversations by answering a few yes/no questions. I want to use logprobs to obtain a probability score.

To Reproduce

  1. Write a prompt that asks for a bunch of questions with yes/no answers (I am not at liberty to share the one I used as it contains company IP)
  2. Execute the llm_annotate_conversation function below
  3. Inspect the logprobs at the position of yes/no tokens

useful helper:

yesno_idxs = [i for i, tokenlogprob in enumerate(response.choices[0].logprobs.content) if tokenlogprob.token.strip().lower() in ['yes', 'no']]

Think this is a problem with the actual API over the python sdk.

I’ve been facing the same issue since last week. The log probs are completely off. Tried to reach out to openai support and their reply was like ā€œtry to write clearer instructionsā€ :sweat_smile:

It has now been placed in the documentation.

The log probability of this token, if it is within the top 20 most likely tokens. Otherwise, the value -9999.0 is used to signify that the token is very unlikely.

From the description, one would thing it is purposeful to obfuscate further architectural discoveries, masking the sampling from anywhere below the top-20 that was returned, or just hiding how the token was produced.

The problem is that it occurs when the token has value contained within the top 20, also - or is definitely near 100%.

1 Like