Perhaps a better way to say this, using the MCMC analogy, is that, when generating a token, all previous tokens in the chain have been sampled, that is, the new token is being generated given instance values of all previous output tokens as well as the original input
The model doesn’t “try to generate” some sequence of tokens, then run a hypothesis checker on that trial, and then roll back that sequence if the checker says “not good enough.” The search is single-token. It’s not even equivalent to a multi-step planner or a minimax optimizer, much less a full theorem prover. And neither planners, minimax optimizers, nor theorem provers are, in the end, “reasoning” (but I argue they are closer in many regards.)
If you describe a new, chess-like, game, to this model, and then ask it to play the game, it won’t be able to do well, and frequently doesn’t even follow the rules. This is despite the fact that simple 64 kilobyte 8-bit MCUs are able to play a reasonable (though not impressive) game of chess. And chess is an open-knowledge game, it doesn’t even add the complication of estimating hidden state of the world.
It can fake it pretty convincingly in simple cases, though