Issues with DPO fine tuning

Hey all,

I’m working on a project that rewrites a text into the style of a given author.
To do so, I’m doing one supervised fine-tuning per author, my datasets being a combination of actual text (what I expect) and “unstyled” version of the same text as input.

I got pretty good results with SFT and I wanted to try with DPO but I encountered various issues like:

  • answers that are just an infinite number of blank spaces
  • same sentence repeated indefinitely
  • answers that are just the same as the input
  • answers with a crazy style (“sOmEtHiNg LIKE this!!!”) that is not present in the dataset

I tried to use DPO on top of GPT-4o directly, or on top of the SFT model I previously fine-tuned. I also built my datasets in different ways:

  • preferred output = author text, non preferred output = unstyled version
  • preferred output = author text, non preferred output = text generated by GPT-4o
  • preferred output = author text, non preferred output = text generated by the SFT model

But no luck.

If you have experience with DPO and can give advices, I’d be very grateful.

1 Like

whats the format in which your training data was can i get a single line ?