Hey all,
I’m working on a project that rewrites a text into the style of a given author.
To do so, I’m doing one supervised fine-tuning per author, my datasets being a combination of actual text (what I expect) and “unstyled” version of the same text as input.
I got pretty good results with SFT and I wanted to try with DPO but I encountered various issues like:
- answers that are just an infinite number of blank spaces
- same sentence repeated indefinitely
- answers that are just the same as the input
- answers with a crazy style (“sOmEtHiNg LIKE this!!!”) that is not present in the dataset
I tried to use DPO on top of GPT-4o directly, or on top of the SFT model I previously fine-tuned. I also built my datasets in different ways:
- preferred output = author text, non preferred output = unstyled version
- preferred output = author text, non preferred output = text generated by GPT-4o
- preferred output = author text, non preferred output = text generated by the SFT model
But no luck.
If you have experience with DPO and can give advices, I’d be very grateful.