After reading this very interesting paper:
I’m wondering what is the method used by OpenAI thru their Fine-tuning API: SFT, RLHF or DPO.
This would influence greatly the way to construct the training datasets and explain the good or bad behaviour of the training process, and so the final result.