AMA on the 17th of December with OpenAI's API Team: Post Your Questions Here

Here’s a clearer version of your question:

Would it be possible to automate the DPO (Direct Preference Optimization) training data collection process? I’m considering two approaches:

  1. Using a fine-tuned model as a preference judge
  2. Using an OpenAI model as a general preference judge

The workflow would be:

  • Input: List of prompts
  • Output: Multiple model responses per prompt
  • Judge model: Assigns preferences between responses
  • Result: Automated DPO training dataset

Additionally, can DPO handle multi-turn conversations where we provide a complete dialogue (user-assistant-user-assistant) with only the final assistant response varying?

1 Like