Just a regular sentence.
For example, can try with following input prompt:
“How did WW2 start?”
I will get two different answers for this configuration.
Yes, what’s happening before the generation of token probabilities is unreliable calculations. Your first token of a story might be "The"=33.55% on one run, and "The"=33.21% on another, and with them bouncing around in successive generations, even with greedy sampling, symptoms manifest such as the second-ranked token “A”=33.33% (+/- x%) becoming the first place and thus selected.
This is exactly seen in the one 3.5 model we get logprobs from. The symptom is seen in the rest.
4 versions using seed/top_p and 50/200 token limit. 100 completions for each of the variation.
More variations of the experiment (both in terms of models and input data) are required to make definitive conclusion.
From the tests done so far we can see that seed parameter gives more stable results that low top_p and the longer the completion, the higher variability.
Interestingly, adding very low top_p increased variability comparing to not having it.