Predicting full sentence autoregressively

One could directly predict the full sentence. Imagine the output layer as a vector of some activations, where each activation would represent a word in a language (basically, how it is now). But instead the activations being the probabilities of the next word, it would be easily 1, if the word is present in the sentence, and 0 if not. As a next step, I could have another transformer or neural net, which would just order these words into the right order so that the sentence makes sense. I think this approach would fasten the answering by a lot. I also supposes that this approach is not covered by non-autoregressive models since it would still keep the autoregressive part.