My Models DPO val loss is not getting decreased,
I tried a lot with different strategies but no inprovement
here is the traning and validation stats
Epoch: 1/5: 6400it [15:48, 5.64it/s, mean_loss=0.667, mean_accumulation_loss=0.667, per_update_loss=0.638, mean_policy_margine=-37.4, mean_ref_margine=-38.1, mean_advantage=0.667, mean_accepted_reward=1.42, mean_rejected_reward=0.496, mean_kl=14.2, save_iters=50, pointer_location=5000, current_lrs=[2.0000000000000003e-06, 2.0000000000000003e-06], total_iters=6400, total_accumulation_iters=50, prev_shuffle_seq_len=3600, memory_usage=6.36e+6]
Val Loss: 0.7040763644035906
Val Policy Margine: 19.06810326129198
Val Ref Margine: 19.21380713954568
Val Advantage: -0.14570387825369835
Val Accepted Reward: 0.14107751794205114
Val Rejected Reward: 0.15564789006333513
Epoch: 1/5: 12800it [33:54, 5.35it/s, mean_loss=0.64, mean_accumulation_loss=0.64, per_update_loss=0.605, mean_policy_margine=-34.7, mean_ref_margine=-36.5, mean_advantage=1.79, mean_accepted_reward=5.32, mean_rejected_reward=-0.0884, mean_kl=53.2, save_iters=50, pointer_location=1e+4, current_lrs=[4.000000000000001e-06, 4.000000000000001e-06], total_iters=12800, total_accumulation_iters=100, prev_shuffle_seq_len=2195, memory_usage=6.24e+6]
Pointer Loaction is set to: 0
Val Loss: 0.7111321217380464
Val Policy Margine: 22.200062219053507
Val Ref Margine: 22.36575211212039
Val Advantage: -0.1656898930668831
Val Accepted Reward: 0.061537228080567274
Val Rejected Reward: 0.0781061984588689
Any one has good experience with DPO, I am looking for the Help, Because Nothing coming in mind regarding what to do how will the val loss will decrease and val advantage will increase like taring