We are actively working on a guide around how to approach reinforcement fine-tuning, and will publish it once it’s ready! But there are still some details we wanted to work out first.
At a high level though, I would say to keep several things in mind:
- The underlying dataset are “tasks” – a set of instructions paired with an output that is the result of the task
- Make sure the task is autogradable via the options we make available in the API (it should be easy to verify if the task was done correctly or not). We currently support a set of graders, but will expand over time. I know this is a bit hard because you can’t see the set of graders available until we launch more broadly – but easily gradable tasks (i.e. string match) are more likely to work out of the box.
- Make sure that the task is clear enough that if expert humans do it, they also converge onto the same answer.
More to come here soon!