Why GPT-4 generates not good java code but definetly understood how to provide better?

Sometime source code generate by GPT contains wrong things in term of code quality. Like very long method, few inners loop, copy-pasting ( from human view), hardcoded contains and so. But its easy to ask to refactor it. Once time I decided to ask GPT-4 what it is thinking about generated code in very broad sense - “Is generated code good or bad?”. GPT-4 really surprised me by answer because level of system self criticism allow to detect issues in a right way and offer new refactored version without any hints or advice!
So why that knowledge are not used during generating “by default” ?