There’s inherent non determinism in GPU calculations around floating point operations - the differences in log probabilities are tiny, but when there’s a small difference between the top two likely tokens, then a different token might be chosen every now and then leading to different results