Here’s a clearer version of your question:
Would it be possible to automate the DPO (Direct Preference Optimization) training data collection process? I’m considering two approaches:
- Using a fine-tuned model as a preference judge
- Using an OpenAI model as a general preference judge
The workflow would be:
- Input: List of prompts
- Output: Multiple model responses per prompt
- Judge model: Assigns preferences between responses
- Result: Automated DPO training dataset
Additionally, can DPO handle multi-turn conversations where we provide a complete dialogue (user-assistant-user-assistant) with only the final assistant response varying?