Thought/answer pattern while evaluating confidence from logprobs


The big picture is that I’m trying to get the best of both worlds, and I wonder if that’s possible.
The context is the example from “using_logprobs”, an article in openai’s cookbook (I’m not allowed to insert the link here), where the LLM is asked to classify (through a fairly simple prompt) some news headlines into either Technology, Politics, Sports, or Arts.

First, I want to have an estimate of GPT’s confidence in its response. To that end, the prompt says
Return only the name of the category, and nothing else.
and we can inspect the log prob for each category in the LLM’s response. For example, for the headlines Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut, the probabilities are:
Art, logprobs: -0.009169078, linear probability: 99.09%
Sports, logprobs: -4.696669, linear probability: 0.91%
and so on. All is well, we get an idea of the LLM’s confidence for all 4 categories.

Now to the second thing I would like: prompting the LLM to respond only with the category is not optimal in terms of response quality. I’m thinking specifically of the chain-of-thoughts pattern. It is easy to modify the prompt to set up that chain-of-thoughts pattern. The problem is, it is now difficult to get meaningful probabilities: by the time the LLM gets to say its answer, it will have made up its mind thanks to the preceding thought it laid out. So I’m expecting a very high probability for the predicted category, which is not reflective of the LLM’s lack of confidence.
One could also calculate the probability for the entire response, including the thought leading to the predicted category. But in that case we don’t know the probability of thoughts (nor the thoughts themselves) leading to the other categories, so i don’t see how we can get a fair category-to-category comparison like we had above, in the “no-thought” answer.

My question is:
Is there any way to do things differently so I get both 1) the high-quality response of the chain of thought AND 2) a good estimate of the confidence of all categories?

Thanks a lot!