Fine-tuned GPT-4.1 / GPT-4o breaking context, looping, or clinging to training patterns

Hi everyone,

I’ve been fine-tuning GPT-4.1 and GPT-4o using SFT. My goal isn’t to add domain knowledge — I just want the model to follow a specific style and persona. The dataset is small (around 60 training samples + 10 validation), all written in Korean, and the model is also supposed to respond in Korean.

The strange part is that the fine-tuned models behave in ways the base models never do:

  • drifting out of context,

  • giving unrelated or nonsensical answers,

  • repeating phrases or falling into loops (even when I tell it not to),

  • sticking to the previous topic after I change subjects,

  • obsessively mimicking patterns from the training data.

I expected some overfitting with a small dataset, but this feels more like the model is becoming unstable rather than just overfitting — especially since I’m only trying to adjust style/persona, not teach new knowledge.

Before I scale up the dataset, I wanted to ask: is this normal when fine-tuning GPT-4.x with small SFT datasets, especially with non-English data? Or does it sound like something else is going wrong? And is there any known workaround to reduce the looping or context issues?

Any advice or similar experiences would be really appreciated. Thanks.

1 Like

It sounds like you are still trying to “chat” with the AI in multiturn interactions.

  • Are you fine-tuning on complete conversations of the length you are extending the chat to?
  • Are you using a distinct system message in the fine-tuning, and then also reproducing that exactly in the chat inference? The quality of system message that could already get the persona job done?

If neither, there’s your “aha” moment. If you’ve significantly modified the model weights, but only in rewarding 300 token context window completion sequences, the AI is in unknown demoted territory when it goes beyond that. Position matters.

I would try the “steps” you are also given, to see if an earlier training point works better.

Then use hyperparameters, and instead of letting OpenAI decide completely (which they still do), use a set number of n_epochs such as 3, and then a lower learning multiplier, such as 0.4 to reduce the impact of a pass.

Hopefully then you’d not degrade the extensively-trained chat ability within just a few turns of fine-tuning.

1 Like

Have you tried it at different temperatures? I’ve found that fine-tuned models don’t do as well at the default high temperature setting.