Hi
I finetuned GPT4o on 1000 samples in a supervised setup and it did not cause any issue. Then I tried to do DPO on same 1000 preferred responses along with 1000 non preferred response.
All this data (preferred and non preferred) was AI generated. I am baffled because my data has no issues and during checking training and validation data file it wasn’t flagged before the finetuning started. But it fails at the end. I am really wondering how to get any help on this. Thanks so much