Segmenting training data for fine-tune

The old method of fine tuning took prompt completion pairs:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

Whereas it now takes a list of messages:

{"messages": [{"role": "system", "content": "Marv is a factual..."}...]

Maybe this is self evident, but in this latest version, are we supposed to segment the messages ourselves?

What I mean: a completed conversation is one list. In the new format, do I upload that list a single time?

In the old version, I would segment the conversation such that each assistant turn was a completion and the prompt was everything before that turn.

In this new version, it’s not explicitly clear what is happening. My interpretation is that I should not segment a conversation. But this is me reading in between the lines.

Does anyone have opinions, docs, or data on this?

1 Like

I had the same confusion and found an answer here: Fine-tuning with conversation format: Which messages are used for training? , it seems we still need to segment the conversation despite we provided it as a message list. Please note this is from a cookbook’s comment and not guaranteed to be true, I didn’t find any references in the official docs, TBH the fine tuning API is so poorly documented.