Alright, I tried the actual API, and it does return the list of logprobs for each token in both the input and output.
Thus, you could use that information to calculate the “probability” of this particular sequence being chosen, and using that as some form of “confidence.” But, in general, I wouldn’t put too much stock in that value, because it will be poorly behaved – for longer outputs, the probabilities will multiply out to lower overall probability, and if the model picks one low-probability token (which could be something unimportant like “and” instead of “the”) the overall probability will multiply out to much lower.
Anyway – if what you want is “confidence in the overall answer” then you can’t really construct that from “random probability of each word fragment token.”
This actually gives a pretty neat insight into why I think these models aren’t really “thinking,” too – they just predict, one token after the next, with no “overall” model of what they’re doing.