In my experience of extensively using all the latest models from openAI since the first release of GPT 3, I can see that the open ai models although getting smarter significantly, but also they have been getting lazier with every new release. This lazyness trend has increased in my experience after release of gpt-4o, which I think is the best model so far in terms of being less lazy.
The lazy GPT issue is the main draw back for using GPT models for any serious task, especially when it comes to coding. The lazy GPT is not limited to older models, and it is actually more pronounced in the newer models and even the reasoning models.
By lazy, I mean not providing a complete answer to a request even when instructed. The model tends to avoid doing a hard work. As an example, if I give a model a body of text and well below its context window limit, and then give an additional context to be integrated into the original document and instruct it to maintain all the details of the original document and integrate the new information. The returned result will be a summary of the original document with added summary of the new info. A lot of information will be lost, then when I point out, the model apologizes and makes it somewhat better, but still it doesn’t return everything that it is asked.
In another example, the GPT models, even the reasoning ones, when used in coding with Cursor, almost always implement a simplified solution, or more of a mockup, even if instructed to do a complete the implementation, which make these models absolutely useless in real world coding problems.
Why it is happening?
My hypotheses for why it is happening and why it got worst with newer models? I believe, the culprit is using GPT itself in the reinforcement learning loop to somewhat evaluate AI’s responses, or when they use GPT to generate synthetic data. Assuming models used to generate synthetic data or models that evaluate another models responses in a RL loop have some inherent lazyness that has not properly resolved yet, this lazyness could be magnified in an RL loop, and despite the base model going through RL loop being significantly smarter, it may also become significantly lazier during the RL training loops.
Final notes, lazy engineering can result in lazy models. I really hope openAI’s team will pay attention to this issue since it is the biggest draw back for why GPT models are not as useful anymore. They are still useful for day to day tasks, but not for serious tasks.
I’d love to hear from expert about tis issue, please leave a comment and share your thoughts.