O1-preview outperforms o1, o3, and deepseek in role-play heterogenous personas

I’m running experiments to prompt LLM to be Homo Silicus. An interesting observation is that o1-preview is performing better than o1 and o3 when role-playing heterogeneous personas. I’m disappointed to learn that o1-preview will be deprecated on July 28, 2025. In my experiment, I found that o1, o3, and DeepSeek, when trained as reasoning models, are good at computing math problems but fail to role-play heterogeneous personas who may not be good at math. More details about my experiments are here [ Chen, C., Karaduman, O. and Kuang, X., 2025. Behavioral Generative Agents for Energy Operations. arXiv preprint arXiv:2506.12664 .]

I have the same problem/observation. I have an Email Reply Agent, that handles some of my support requests. I started it with O1-preview, cause it was the only reasoning model at that time.
Then I tried every new model after it. But no model is that natural and writes such authentic mails as o1-preview. All other models just exaggerate, write whole romans for simple answers, start big lists, but just don’t write human like messages. O1-preview was really a unique model in that way and all other models afterwards no improvement in that specific kind of area

1 Like