I have several (prompt, chat history) pairs from running a model in a type of “agent” model, allowing it to perform several actions in succession. I store the entire chat history. For each history, I also score the history based upon the artifacts generated from the run. For example, I may have a prompt that asks the model to conduct some task requiring several steps. At the end of the task, a file is generated that I can check to make sure it matches what is expected. This is how I give a single chat history (many messages w/ tool calls and tool responses) a score.
To be explicit, as an example, consider that I have 100 chat histories, which are about ~15 messages per history. So, I have 100 scores—one for each history.
I have been reading through the fine-tuning docs, especially for DPO and RL, but I cannot seem to understand how to use the API with my data set.
It seems that for DPO, the preferred and rejected responses must be a single message, even though the format accepts an array (e.g. [{“role”: “assistant”, “content”: “Hi”}])
My issue seems to be that instead of a score for a single response, I have a score for a collection of responses (subsequent tool calls and assistant messages).
It seems that for RL, a score is computed per message. But in my case, I only compute a score at the end of several messages.
Is my understanding correct here? It seems that the API does not really support the type of dataset that I have. What is the protocol for training on this type of dataset? Should I be using a different api/sdk?