Can I fine-tune the model without the prompt and answer for the "system" role?

Hello there,

I am fine-tuning a gpt-3.5-turbo model for healthcare purpose.
And I want to create a fine-tune model without “system” role. For example:

Default:
{"messages": [{"role": "system", "content": "You are a nutritionist providing guidance on healthy eating habits."}, {"role": "user", "content": ""Can you recommend a balanced diet for weight management?"}, {"role": "assistant", "content": "A balanced diet for weight management includes a variety of fruits, vegetables... (etc)"}]}

I want:
{"messages": [{"role": "user", "content": "Can you recommend a balanced diet for weight management?"}, {"role": "assistant", "content": "A balanced diet for weight management includes a variety of fruits, vegetables... (etc)"}]}

Because in my application (the application that uses this fine-tuned model) I have already set the “system” role for it, so I don’t want to create “system” role in this dataset for reduce tokens (avoid disturbing output data - I thought so).

Can I do it? Thank you!

You can do anything you want, whether or not you should is something else altogether.

In this case it is absolutely not advised to fine-tune with a system message different than the system message used in your application.

The system message is critical for controlling the behaviour of your fine-tuned model. Not including it in the training set but using it in production would be, at best, unpredictable.

4 Likes

Yeah, elmstedt

I’m concerned that the “system” role in the dataset will distort the purpose of the model because it might conflict with the “system” role I use in the application.

Since I know that the “system” role in my application has only one (no need for variety), do you think I should skip adding the “system” role in the dataset and just leave it in my application? ? Because if I leave the “system” role in the dataset, the number of tokens will be very long (because the “system” prompt is long), and it keeps repeating for each example without changing.

1 Like

I think you need[1] to include the system message in the fine-tuning set.

One stated benefit of fine-tuning though is that often the system message in production can be greatly shortened if the fine-tuned model has sufficiently internalized the one used in training.

So, it may ultimately end up cheaper to include the system message in the fine-tuning set.

But, that’s something you’ll likely want to test before fully committing to.


  1. For best results. ↩︎

1 Like

Thank you, elmstedt

I got it.

The AI doesn’t really learn instruction following in system prompts when you are fine-tuning like you describe. What the system prompt in fine-tune does is create the “cause”, and the generation is the “effect”.

If you remove the “cause” language in later operational input, then the whole effect is lost. The input looks a lot more like ChatGPT training than the fine-tune training.

That is counter to my reading of the OpenAI documentation and my experience.

If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those “baked-in” instructions at inference time.

My understanding is the strength of the affect is highly dependent on the number of trained examples and how resilient the model is the the desired behavioural modification.

For instance, if you did a fine-tuning where the system message included an instruction to begin every response with “Certainly!” I suspect you wouldn’t need that many training examples to make that behaviour stick without including the instruction at inference-time.

I’m uncertain why you believe the behaviour of the fine-tuned model wouldn’t be affected by a constant system message during training.

1 Like

I’m thinking about the fine-tune in terms of underlying mechanisms:

  • machine learning on sequences heightens the probability of those or semantically similar sequences;
  • fine-tune doesn’t directly “talk to the AI” when it adjusts weights;
  • having the AI answer about information present in fine-tune is a step removed from having the AI behave according to information present in fine-tune without example.

The attention mechanisms that differentiate input from output during ML, or even different role messages in chat, are not described if they exist.

In your “for instance”, it would be a odd case of anti-training if we instruct in a system prompt to reply “certainly”, but then the examples necessarily never followed that. Omission of example training, but not disobedience, would be what we need. Another scenario where the instruction is simply never utilized in examples would certainly be a worthwhile investigation into how system programming via fine-tune might be internalized and then omitted in practical use.

I liken the system fine-tune (and other identity-style fine tune in completions) to be laying down a path of probability, where the completion as we fine-tune becomes highly likely. The AI understanding it is a bonus. Train on a polite chatbot being system="XXXXXX" and a rude chatbot being system="ZZZZZZ" in your common fine tune file, and you have guidance without instruction.


That section is a terse read and a challenge to comprehension even when I’m open to internalizing everything it says.

We generally recommend taking the set of instructions and prompts that you found worked best for the model prior to fine-tuning, and including them in every training example. This should let you reach the best and most general results, especially if you have relatively few (e.g. under a hundred) training examples.

If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those “baked-in” instructions at inference time.

It may take more training examples to arrive at good results, as the model has to learn entirely through demonstration and without guided instructions.

My understanding is the strength of the affect is highly dependent on the number of trained examples and how resilient the model is the the desired behavioural modification.

That’s how I understood the docs as well. I may have it wrong, but I’m under the impression that the message from openai is: The system message is implied so don’t waste training tokens including it.

I think the bigger picture they are getting at is that if the system message is baked in then including one would be prone to confusing things. If you want a fine-tuned model that you can properly use system messages with it would follow that fine-tuning would be better done without them. I’m not an expert, just sharing what I have noticed.

On that we certainly agree.

My close reading of it is that it’s assumed the provided responses would accurately follow instructions as provided, so the model can learn the instructions without the instructions being explicitly provided because it will naturally infer their existence.

Including the instructions during training improves the responses of the model during training allowing it to converge more quickly to the desired behaviours.

But, I think we’re moving away from the key points I was trying to make—the best results typically come from using the same system message in training and inference and if you want to train a behaviour implicitly you need likely orders of magnitude more training examples especially if you have complex or convoluted instructions you need the model to adhere to.

But, perhaps I wasn’t as clear as I would have liked to be before in what I was trying to convey.

1 Like

Exactly. Wrestling with what’s already in there probably won’t end well.

allowing it to converge more quickly to the desired behaviours.

Well, I’d agree with allowing it to converge more quickly. Desired behaviours is a different subject, possibly the most difficult one there is in machine learning. As you say, if the information is almost explicit in it’s intent and the instructions are self evident the model learns very well.

I think I’ve got a new take on what they were trying to document here:

If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those “baked-in” instructions at inference time.

  • If a fine-tune example has a very specific system prompt that you occasionally use in practice, then by replicating that system prompt, you will get the desired behavior in those specific circumstances.
  • If all examples instead have a very basic common system prompt, then that generalized inference tuning will be elicited by most any system prompt you might then later use, even with contrary instructions.

I’m currently working on a project that involves fine-tuning a language model (LLM) AI for a specialized application. The task requires creating detailed system instructions that go beyond basic personality traits or simple behavior modifications (like “Marv is a salty chatbot”).

I’m looking for advice or best practices on how to effectively structure and articulate these instructions to achieve nuanced and specific outcomes from the AI.

Key challenges include:

  • Ensuring the AI comprehends and adapts to specialized and complex contexts.
  • Balancing the need for detailed specificity with the constraints of instruction length and clarity.
  • Integrating the AI’s learning capability to recognize and respond to nuanced patterns and data variations.

If anyone has experience or insights on developing such detailed system instructions, your input would be incredibly valuable. How do you approach the balance between specificity and generalizability? What strategies do you use to ensure the instructions are clear yet comprehensive enough for the AI to execute complex tasks?

Thank you in advance for your help!