What is the difference of o1 series models from plain transformers?
A transformer can generate a series of thoughts, separating them, say by “###”. How are o1 “forced” to do a different thing than a plain transformer would do? How is it forced to add the next thought, not just to add to a previous one, as a plain GPT would do?