How to fine-tune using DPO/RL on entire chat history (agent workflows)

JGraves · July 7, 2025, 5:27pm

I have several (prompt, chat history) pairs from running a model in a type of “agent” model, allowing it to perform several actions in succession. I store the entire chat history. For each history, I also score the history based upon the artifacts generated from the run. For example, I may have a prompt that asks the model to conduct some task requiring several steps. At the end of the task, a file is generated that I can check to make sure it matches what is expected. This is how I give a single chat history (many messages w/ tool calls and tool responses) a score.

To be explicit, as an example, consider that I have 100 chat histories, which are about ~15 messages per history. So, I have 100 scores—one for each history.

I have been reading through the fine-tuning docs, especially for DPO and RL, but I cannot seem to understand how to use the API with my data set.

It seems that for DPO, the preferred and rejected responses must be a single message, even though the format accepts an array (e.g. [{“role”: “assistant”, “content”: “Hi”}])

My issue seems to be that instead of a score for a single response, I have a score for a collection of responses (subsequent tool calls and assistant messages).

It seems that for RL, a score is computed per message. But in my case, I only compute a score at the end of several messages.

Is my understanding correct here? It seems that the API does not really support the type of dataset that I have. What is the protocol for training on this type of dataset? Should I be using a different api/sdk?

yufanz · July 8, 2025, 12:05am

I think typically one can assign a discounted score for the individual messages for one conversation.

Specifically, you can devise a discount factor (0<factor<=1), such that the final message have score (Score * factor^0 = Score * 1), and the other messages have scores = Score * factor ^ t, where t = how many more messages are needed to solve the task.

Topic		Replies	Views
Correct format for dataset in chat model fine-tuning API fine-tuning , documentation	4	2092	January 9, 2024
Fine tuning data format for chatting history API chatgpt	2	425	March 20, 2024
Autoregressive Fine-Tuning for Chat Models API fine-tuning	0	175	July 10, 2024
Fine-Tuning with Non-Prompt/Completion Data: Seeking Advice for Direct Text-Based Training? API gpt-4 , chatgpt , fine-tuning , api	3	522	August 23, 2024
Fine-tuning for more natural responses API fine-tuning	4	761	January 13, 2025

How to fine-tune using DPO/RL on entire chat history (agent workflows)

Related topics