Mixed Training Data for Fine-Tuning

From what I have seen in the docs and online, the best practices when fine tuning a model for stylistic output have been a combination of prompt and single turn completions and prompt + multi turn completions.

I have both single turn and multi turn completion examples but I also have some longer form content that I am not quite sure how to chunk. It is not as stylistically accurate for the context I want to use it in since it was written as a long form piece but it has a lot of depth of information about the author and their longer form thoughts / style of writing.

Are there any best practices around using or not using that data or pulling that data in later on in the process via RAG? My main goal is aligning with the style, the knowledge / signal in the longer form data or shorter for data isn’t as important.

From my personal experience the best way to approach the “stylistic fine tuning” problem is to “reverse the style of the output”. Here is how I do it:

  1. Get the text in target style.
  2. Chop it into “single idea” blocks (usually 1-3 paragraphs around the same idea, 1 is more often that 3)
  3. Use latest models with well tested prompt to “strip the style” - something like: “…you’re a highly professional editor with 25 years of experience in linguistics and your task is to rewrite the given except of text to remove all stylistic particularities to make the text be as neutral as possible, preserving all the details, facts and nuances present in the text…”
  4. Feed your input text to that model and save the output+input in a jsonl file
  5. Use any programming language to remap the file to use model output as input and input text as output
  6. Train the model on the file (fine tune) making sure you have a validation sample to track performance (for this task something under 0.005 is really good)
  7. Train a second model for the “de styling” task so that next time it is cheaper (+ your going to need the second one more often than you think)

When converting from one style into another, the workflow is as follows:

  1. De-style the input text using the second model
  2. Use output from step #1 as input for the “styling” model.



BTW, length matters, the chunks limited to one idea work the best in style conversion as the model “keeps the facts”

1 Like

Very interesting, so then is the fine tuned model then used as part of a pipeline? Using 1 model to generate the output and the fine-tuned model to restyle the output?

I was under the impression that I could still retain the general purpose knowledge of the larger model but just tune how it styles the output rather than end up with a model which has the only purpose of restyling messages.

This method of having a restyler model would likely be a more robust solution though.

Oh yeah, they work better when used for more specific and narrower tasks. Plus, it gives you modular approach to swap things as you need.

1 Like

So when adding the fine tuned model that goes from raw text to styled text, how are you prompting those models generally at a system level? Is the prompt similar to the de-style prompt where the task of the model is to restyle the text? If so how does the model have context in the prompt of how to style the text outside of the fine-tuning that has already happened?

Something like (system message) “Please, rewrite the provided text in your style:”.

User message is the raw text input, already processed by your logic, the final answer. The styling is a mere “filter” of the output processed by your application.

The prompt must literally match the prompt used in training.

Sometimes, a style parameters are needed (tone variance, emotion etc). Usually it is the case when the de-styled text loses that information. In those cases, special training is done as follows:

  1. Analyse the style/extract needed meta data from the input text before de-styling it.
  2. De-style the input text.
  3. Create system prompt with extracted parameters, example: Please rewrite the given text in your style. Make sure to use $text_tone tone and make the reader understand you’re in a $text_mood.
  4. Use the de-styled text as user message
  5. Use the original text as assistant message.

Once trained, use exactly the same prompt format to add style to the text.

The model does not have the context of how to style the text. You need to understand that the best performance in styling can be achieved when the required style is the ONLY style the model can produce. That’s the goal of fine-tuning. Some variance inside the SAME style can be achieved using the approach above and providing the parameters in the system message, but if you want to have the max quality of output, they should be minimal variations. When your app absolutely requires multiple styles, best way to preserve the quality is to have separate models for each style.

Now if you want just one model capable of switching styles per “parameters”, to get decent results you need to have at least 2k samples per style in your training and include the style as parameter in the system message (see the approach above). But still the model quality will be usually lower than the "style specific"models.

Oh and for the content aspect of the output. There is no context related to the content other than the text from there user message to be returned in the new style without changing the meaning. If your app requires a change in meaning at this step, most likely you have an error in your logic. Leaving this in case someone is not sure about this point.


The app must be done with content/meaning of the message before the styling is applied. Here sky is the limit, the styling model simply rewrites the output in the style you need.

Yeah I am operating with the assumption that I’ll have a single fine tuned model per style I want to generate.

Regarding the system prompt you use when tuning the model, do you do a similar type of prompt to the de-style prompt? So writing out a prompt that performs well at styling a text to the goal text?

The system prompts I use in fine tuning are very basic and mostly are just naming the task to perform… Because it is high, huge, volume of high quality training data that allows to get the results needed. Not the prompt. Like you would aim to start testing performance at 2-3k of selected samples, last model in styling I did was about 30k samples.

1 Like