Evaluate confidence on GPT-5 response

Recently, I’ve been working on a project using the OpenAI API with the GPT-4 model, where I used the logprobs function to calculate perplexity and estimate the model’s confidence in its responses. I noticed that for a specific task, the GPT-5 model performed better; however, the new model does not support the logprobs function. Is there any other way to estimate the model’s confidence in its responses without using logprobs?

A replacement for logprobs, and then also sampling controls such as top_p? Not really mathematically; only by further language analysis that is merely speculative if you would achieve success.

  1. Capture the reasoning summaries: have an AI grader with lots of its own prompt and examples determine how much waffling, uncertainty, and deliberation is seen in this rewritten text.
  2. Trials: run several and look for consistency, or departures in the delivered outputs, along with similar analysis if reasoning summary methods or internal choices are also divergent.
  3. gpt-5-pro (only on Responses) is “parallel test time compute”; essentially a token run sequence perplexity chooser with an unseen and uncontrollable amount of extra generation that you have to believe and trust matches the 12x cost.

gpt-5 makes for a poor “judge”. It seems to have a character of knowledge-based intelligence but lack of holistic understanding, and then that every token position is quite randomly-sampled between runs.

1 Like