Logprobs are flaky with gpt-4o. Sometimes the logprob for the top token is coming back as -9999.0 and the top_logprobs do not align with what is actually generated. I tested this same problem with gpt-4 and gpt-4o-mini and it seems to work fine, so seems just to be a problem with 4o.
See below:
ipdb> response.choices[0].logprobs.content[20]
ChatCompletionTokenLogprob(token='no', bytes=[110, 111], logprob=-9999.0, top_logprobs=[TopLogprob(token='iveness', bytes=[105, 118, 101, 110, 101, 115, 115], logprob=0.0), TopLogprob(token='iven', bytes=[105, 118, 101, 110], logprob=-19.625)])
ipdb> response.choices[0].logprobs.content[28]
ChatCompletionTokenLogprob(token='yes', bytes=[121, 101, 115], logprob=-9999.0, top_logprobs=[TopLogprob(token=' Utility', bytes=[32, 85, 116, 105, 108, 105, 116, 121], logprob=0.0), TopLogprob(token='Utility', bytes=[85, 116, 105, 108, 105, 116, 121], logprob=-20.625)])
Iām trying to use LLMs to āevaluateā some conversations by answering a few yes/no questions. I want to use logprobs to obtain a probability score.
To Reproduce
- Write a prompt that asks for a bunch of questions with yes/no answers (I am not at liberty to share the one I used as it contains company IP)
- Execute the llm_annotate_conversation function below
- Inspect the logprobs at the position of yes/no tokens
useful helper:
yesno_idxs = [i for i, tokenlogprob in enumerate(response.choices[0].logprobs.content) if tokenlogprob.token.strip().lower() in ['yes', 'no']]
Think this is a problem with the actual API over the python sdk.