If you were to actually examine the logits, the logprobs after softmax, you would have seen on GPT-3 models that the results were always the same. Something about the optimization of OpenAI models after or the hardware it runs on produces some variance in the output values between runs, a fraction of a percentage when examining the top probabilities.
Perhaps what you are wondering, though, is why you get significantly different responses each time.
That is due to token sampling.
The result of language inference is a certainty score assigned to each token in the the model language dictionary (token encoder). One could simply pick the top result for every token that is generated. However, it was discovered that such an output actually isn’t very natural or human.
Instead, the total scores are combined into a normalized probability distribution, where the sum of all certainties = 1.0, or 100%. Imagine a roulette wheel where the slot for “The
” is wide because it is well predicted for a generation, while the token “zzarella
” is a poor way to start a sentence, and gets an infinitesimal sliver of chance.
Thus in any trial you have words appearing with direct relation to the model’s prediction of likelihood at that position.
The direct correlation of certainty to probability can be altered with the sampling parameters top_p, and temperature.
Top-p is performed first. When it is set under 1.0, the least probable tokens in the tail of probability space are eliminated. 0.9 would allow only those that occupy the top 90% of probability mass.
Temperature then is a weighting, where reducing the value increases the mass of the most likely, and a high number can make values more equal.
In this manner, you can get a creative-sounding but not robotic AI, and with your own use of the sampling parameters, you can reduce some unlikely choices.