I’ll answer some of the more intellectually stimulating questions here - you get to do the hard work.
The outputs can only be mostly reproducible. The model itself has variations between identical runs.
The AI models then use a sampler that randomly picks tokens from probabilities of how likely they are. If the possible first token for a question is 80% certain to be “cat” and 15% certain to be “dog”, then at default API sampling parameters, 15% of the trials will have the AI answering “dog”.
top_p
is the parameter that best gets you an AI that only answers with the charted path. At 0.1 for 10%, the response is only generated from those tokens that occupy the top 10% – the AI could only write “cat”. That’s likely what you want - the best answer.
seed
alone would give you “dog” almost every time if that was selected by a particular seed, the reproducible part there being reusing the same randomness.
These will be verging on hallucination though. If you ask the AI how certain it is, you’re going to get scores related to the training and prompting.
Consider if my system message told the AI about itself one of:
- “The AI is an expert with super-human logic and reasoning skills”
- “ChatPro only gives the correct answer”
- “The AI language generation can make mistakes, so double-check what you wrote”
Such other input context can make the tokens selected as a probability “more confident” if you ask the AI to evaluate itself, not the underlying mechanisms (which the AI can’t observe).
The best way to see the inner workings of certainty is by employing logprobs. In a complex answer, you would have to navigate within the formatted answer and find the relevant tokens.
An example where this technique might be used. “rate this book review from 1-10, from 1: extremely negative, to 10: extremely positive in a JSON (format)”. You can get the top-20 logprobs at the answer position, extract all number tokens from that, and then find a median of the probability mass or an average of the weighted probabilities. Then you’ll likely need to renormalize the answer range so the most positive and most negative can still go from 1-10.
That should give insight to your other questions also. The actual task, you would want to keep relatively simple and well-instructed, ensuring the AI knows what to produce and what input it is answering about while it is producing the output.