Multiple Scenario examples in fine-tuning

Hello,

Request help on the below.

  1. When we fine-tune a model by providing the training data with system, user and assistant messages, is it required to provide the system message once again at the inference time?

  2. There are 2 scenario examples which are in the fine-tuning data. I am including

    1. Valid Scenario 1 example + Outside of the Scenario 1 example and corresponding response.
    2. Valid Scenario 2 example + Outside of the Scenario 2 example and corresponding response.

    In this case during inference time there should be selection of the scenario(1 or 2). Should I add System message in inference to exclude questions apart from Scenario 1 & 2 or the Scenario 1 & 2 examples in fine-tuning only takes care of this?

  3. What proportion of invalid question examples in the above case be included in the training data in addition to valid questions. I have tried out with 20% of invalid question-response
    examples in training data, but at inference time much of the answers are that of invalid questions even when the user question is valid.

1 Like

Request reply on the question.

Sorry, in having a hard time getting your approach. Can you be more specific and explain what you’re doing exactly, some examples, the final goals, etc. that would drastically help me to get the context and maybe help you.

When we fine-tune a model by providing the training data with system, user and assistant messages, is it required to provide the system message once again at the inference time?

Yes.

It depends on your data, the model, and the type of fine-tuning.

  • Were all the system prompts the same? If your data is small, try to use as few diverse system prompts as possible.
  • How many examples did you have for each scenario? For each system prompt, consider at least 5 distinct examples on different topics/cases
  • If it’s about a scenario, first check the loss on SFT, then if it was less than 0.5, then go for DPO. The DPO will give you the conditional results you are looking for, SFT is useful when loss is higher.

The safest way for fine-tuning is to use the model’s own outputs, otherwise it will go outside of your control.

The model does not respect your defined scenarios; it depends on different factors. The best data for debugging is attaching loss and validation relation data here, which will show how the model will react to your data.

Thanks for the response. I am using LLM for assessing a student. I want to conduct 2 exams(Librarian, Segmentation). These are the two scenarios I mentioned. I have kept it to single turn and in future have to enhance this to multi-turn. The two scenarios together has 6373 JSON Lines.

Fine-tuning Trial 1 - Librarian(108 JSON lines) + Segmentation(6264 JSON lines), Training loss -0.2500, All system and user prompt same

When a candidate Comes for the Librarian exam. He/she is thrown the tiltle and abstract of the literature and based on this the examinee has to identify to which subject domain it belongs to. I have Computer Science, Physics, Mathematics, Statistics and 2 other domains’ title and abstract. The examinee has to identify the domain to which the literature belong to.

{“messages”: [ { “role”: “system”, “content”: “The Exam is for librarian and the literature belong to the domain of Quantitative Finance” },
{ “role”: “user”, “content”: “Which literature needs classification?” },
{ “role”: “assistant”, “content”: "The Title is Expropriations… and the Abstract is Using the Panama Papers… "}]}

Each domain(Computer Science, Physics, Mathematics, Statistics...) examples were 18 in number. All together for all domains it was 108 JSON lines

When a candidate comes for the segmentation test. He/she is thrown details of a person like age, marital status, Profession, Experience and others. Based on these details, the user has to identify a segment(A,B,C and D) to which the person can be mapped.

{ “messages”: [{ “role”: “system”, “content”: “Segmentation is B and the Exam is for testing Persona categorisation”},
{ “role”: “user”, “content”: “Which Persona needs classification?” },
{ “role”: “assistant”, “content”: “Gender is Male, Ever Married is Yes, Age is 37,
Graduated, Profession is Executive has Work Experience of 0 Year, Spending Score is Average, Family Size is 3” }]}

In the System message during the inference
(i) It was told to consider queries related to librarian/Segmentation exam and nothing else. It still accepted queries related to the other exam and also any general questions. It ignored the system message content.
(ii) It was told to consider queries related to librarian exam only and was shown how to respond if it belonged to segmentation of generic queries. It still ignores the system message and considers all the queries.

Fine-tuning Trial 2 - Librarian(30 JSON lines) + Segmentation(46 JSON lines), Training loss - 0.2734, All system and user prompt same

Within the fine-tuning, an example each of how to respond to unrelated query for each of the domain was shown

Librarian Exam, user questions about persona and the assistant responds as irrelavent.
{“messages”: [ { “role”: “system”, “content”: “The Exam is for librarian and the literature belong to the domain of Computer Science” },
{ “role”: “user”, “content”: “Which Persona needs classification?” },
{ “role”: “assistant”, “content”: “Irrelevant request” }]}

Segmenation Exam, user questions about literature and the assistant responds as irrelavent.

{“messages”: [{ “role”: “system”, “content”: “Segmentation is C and the Exam is for testing Persona categorisation” },
{ “role”: “user”, “content”: “Which literature needs classification?” },
{ “role”: “assistant”, “content”: “Irrelevant request”}]}

Few general questions were also added with corresponding assisant message as irrelevant response. 11 JSON Lines were examples of irrelevent response out of total 76

(i) The system became very restrictive and for any valid question it answered as irrelevant.

Fine-tuning Trial 3 - Librarian(60 JSON Lines) + Segmentation(40 JSON Lines), Training loss - 1.3905, All system and user prompt same

The fine-tuning system message itself was made restrictive and instructed to focus only on the related exam only.

{“messages”: [{ “role”: “system”, “content”: “You are a helpful assistant focused on answering questions related to the librarian exam, specifically in Computer Science. Keep responses relevant, accurate, and avoid topics outside the exam scope.” },
{ “role”: “user”, “content”: “Which literature needs classification?” },
{ “role”: “assistant”, “content”: “The Title is Contextual Regression: An Accurate … and the Abstract is Machine learning algorithms…”}]}

{“messages”: [{ “role”: “system”, “content”: “You are a helpful assistant focused on answering questions related to the persona segmentation, specifically from Segment D. Keep responses relevant, accurate, and avoid topics outside the exam scope.” },
{ “role”: “user”, “content”: “Which Persona needs classification?” },
{ “role”: “assistant”, “content”: “Gender is Female,Ever Married is No, Age is 27, No Graduation,Profession is Homemaker has Work Experience of 8 Year, Spending Score is Low, Family Size is 1”}]}

Still was not able to confine it to the exam it is set to.

I should be able to

a. Set the LLM to the kind of exam it is conducting and has to respond to relevant exam queries only.
b. In addition to the separation between the exams, it also has to exclude queries beyond the two exams.
c. In future has to be expanded to multi-turn evaluating he answer of domain to which the literature it belong to in the librarian exam and the segment to which a persona maps to in segmentation exam.

Thanks once again.

Thank you for the response. I have provided the details below.

1 Like

Thanks for the response. I am using LLM for assessing a student. I want to conduct 2 exams(Librarian, Segmentation). These are the two scenarios I mentioned. I have kept it to single turn and in future have to enhance this to multi-turn. The two scenarios together has 6373 JSON Lines.

Since the loss is less than 0.5, it means that the model can understand your work, but accuracy is not guaranteed, this is why you might need DPO.

I add some general rules, based on your data and the context you have.

Prompt the model on how it should behave and what should or should not be done. Categorize each scenarios or conditions you have. If you can keep the system prompt the same then would be helpful, if not only change the last words (given distinct elements of each scenario)
Provide context and ask the model again based on the context and other rules and preferences you have, how should model make assessment.
The output you expect from the model. This part might be a little sensitive, so try to use synthetic data from the same model with little changes.

  • Create a good balance of scenarios. Balance and diversity will help the model generalize better.

  • If you use JSON as output?, then you can pass it as a function or if it’s not a function, precede the JSON file with something like "Here is the assessment given json schemajson ... {json details} ... ". If you mean JSONL lines, then this refers to 6k and 100 instance examples? Not sure that you mean json output.

  • Convert your data into training, validation, and test sets and try SFT fine-tunin first.

  • After fine-tuning, test the model with the test set, then create a preference dataset based on correct and incorrect responses from the fine-tuned model.

  • Use the preference dataset to train and optimize the model toward what you want or don’t want.

  • Repeat the DPO process until you get the desired responses from the model.

I would check this post once again and write the better response as soon as I can.

1 Like

Thanks for the reply.

  1. I created a good balance between the scenarios so that both were equally represented. I didn’t split the sample as I actually dont know how the
    train-test work in SFT. The one test-train split I did had very high validation loss of 1.43. I believe that this is because of the generalisation
    capability of the text generator which results in not matching any of the domain literature/persona which is present in the testing sample. Please
    validate this understanding.

  2. I had very less examples of irrelevant responses for both the scenarios.

    There were some wrong responses and now to correct the model as mentioned using DPO.I have a scenario where

a. The System message was that of librarian exam
The user query was “Give me a persona and I’ll classify”

but the assistant gave me "The Title is Irreversible single-ion switches enable a variety of non-volatile electronic applications. However, there is…". 

b. The System message was that of librarian exam
The user query was a general knowledge Qn “Where was the last olympics held”

but the assistant gave me "The Title is Improving Zero-shot Recognition via Semantic-visual Relation Graph with Graph Neural Networks and the Abstract is  Existing zero-shot…"

How do I set the preferred and non-preferred o/p in this case. Also the assistant answer on the literature above is not specific as it can generalise. Also, is there any elaborate literature on fine-tuning as I find the content available on the OpenAI site insufficient.

If the system prompt is diverse, then more of the model parameters will change during fine-tuning (in most cases not a good idea). Because GPT-4o or GPT-4o mini are already fine-tuned, you may want to keep this optimization?, so you should only consider one system prompt or with little change. Each possible output for a prompt will be a child node. Key point is that each child should recognizable for its parent (In simpler term: use the model output for assistants messages)

It is language, and language is flexible, you can put it all in one system prompt. if you structure the fine-tuning data correctly, you increasing the performance.

system prompt: [description of the classification, rules, preferences, instructions, etc] Do not put context here.

user prompt: context + create description what assistant should do.

Asssistant: clear deterministic response you expect from the model( Use synthatic data, the models own output)

The more you alter the model by diverity of system prompt, the more illogical the model gets, why? because the model has been optimized before and you breaking this optimization by fine-tuning it again. So the Key is to fine-tune it close to the base model (already optimized model like GPT-4o or GPT-4o mini).

It may works like a tree structure with parent and child nodes.
Each prompt has unlimited or limited child nodes. If the system prompts are diverse, the number of parents will be more, meaning more parameters of the model should adapt themselves to the fine-tuning data. If you limit the number of system prompts (1-2 depending on the dataset), then the model will quickly adapt to your dataset with fewer samples (even with five, you can achieve what you need).


  1. Training Set
  • The primary dataset used to train the model which model learns patterns, features, and relationships from this data, It typically comprises 60-80% of the total available data, Traning set is conversation data which is used to fine-tune the model, these set will change the model parameters and weights, Let call it fine-tuning dataset.
  1. Validation Set
  • It is usually about 10-20% of the total data, It will not be used to change the model parameters and weights, It will be used to evaluate the model performance, generalization, during training (fine-tuning) process, it is the unseen data that model does not know about.

Each step of fine-tuinng will change the model parameters, so we use this set to see how traning data changes the model parameters and weights.

  1. Test Set
  • Used for final evaluation of the model’s performance, after the model is trained/fine-tuned, we use this set to evaluate the model’s performance. Same as evaluation set, or testing set that you will use to evaluate the model’s performance. This is also 10-20% of the total data. And can be part of the training set or not, depends on what you want to achieve.

Example of data splitting:

# For a dataset with 1000 samples
train_data = data[:600]      # 60% for training
val_data = data[600:800]     # 20% for validation
test_data = data[800:]       # 20% for testing

In most cases:

  • The test set should not be used during training, unless you want to see how the model performs on the training data.
  • Data should be randomly shuffled before splitting into different sets.
  • The splits should maintain the same distribution of classes/categories/topics/scenarios as the original dataset

Low and high validation loss:

High Validation Loss (e.g., > 1.0)

  • Indicates the model is struggling to generalize to unseen data
  • Common causes:
    • Model is overfitting to training data
    • Training and validation data are too different
    • System prompts are too diverse or inconsistent
    • Model is confused by contradictory examples
  • Example scenario:
    Training: Model learns specific librarian responses
    Validation: When seeing new questions, model gives very different responses
    Result: High loss because predictions don't match expected outputs well
    

Low Validation Loss (e.g., < 0.5)

  • Indicates good generalization to unseen data
  • Signs of healthy learning:
    • Model can handle new, similar prompts well
    • Responses are consistent with training examples
    • System prompts are being followed correctly
    • Model has learned underlying patterns, not just memorized
  • Example scenario:
    Training: Model learns librarian response patterns
    Validation: When seeing new questions, model gives similar-style responses
    Result: Low loss because predictions align well with expected outputs
    

example:

# Good Learning Curve
epoch 1: training_loss = 1.2, validation_loss = 1.1
epoch 5: training_loss = 0.6, validation_loss = 0.5
epoch 10: training_loss = 0.4, validation_loss = 0.4

# Bad Learning Curve
epoch 1: training_loss = 1.2, validation_loss = 1.1   # Model already knows the data
epoch 5: training_loss = 0.3, validation_loss = 1.4   # Validation gets worse, there is no generalization, going in wrong direction
epoch 10: training_loss = 0.1, validation_loss = 1.8  # Big gap indicates issues, different between training and validation data

Key Points:

  • Aim for validation loss to decrease along with training loss, this will be a good generalization
  • A large gap between training and validation loss suggests problems
  • Consider reducing system prompt variety if validation loss stays high
1 Like

Thank you for the long response. Request you to suggest a literature to know more about the fine-tuning in addition to the one available on OpenAI docs.

I was trying to find a resource, I coudn’t find something useful or practical. The issue is how you structure your data.

System: Role: input classifier (as Librarian exam) Task: classify the given user's inputs into provided categories.

Prefix_Librarian = Identify the domain to which the literature belong to, given the title and abstract.
Suffix_Librarian = Rule: The domain to which the literature belong to should be one of the following: [Computer Science, Physics, Mathematics, Statistics, other domain, next domain].
Librarian_Context = Title: The Title is Expropriations… Abstract: Using the Panama Papers…

format_Librarian = Please identify the domain to which the literature belong to, given the title and abstract and format your response [specific your format here]

User: Prefix_Librarian + “\n” + Librarian_Context + “\n” + Suffix_Librarian + format_Librarian

Assistant: The domain to which the literature belong to is [Computer Science]


System: Role: input classifier (as Segmentation exam) Task: Categorize personas/individuals into specific segments based on their characteristics, behaviors, or attributes.

Prefix_Segmentation = Given the details of the persona, identify the segment to which the persona belong to.
Suffix_Segmentation = Rule: The segment to which the persona belong to should be one of the following: [A, B, C, D].
Segmentation_Context = Gender is Male, Ever Married is Yes, Age is 37, Graduated, Profession is Executive has Work Experience of 0 Year, Spending Score is Average, Family Size is 3

format_Segmentation = Please identify the segment to which the persona belong to, start with reasoning and place the result inside the [] for clarity.

User: Prefix_Segmentation + “\n” + Segmentation_Context + “\n” + Suffix_Segmentation + format_Segmentation

Assistant: Based on the characteristics provided, this is a relevent request and this persona belongs to the 'Urban Tech Professional' segment due to: age, occupation, location, and digital behavior patterns. so the segment is [B] // We provide reasoning before providing the segment classification if there no logical relationship between the persona and the segment then no need to provide reasoning.

Example Structure:

{
    "messages": [
        {
            "role": "system",
            "content": "Categorize personas/individuals into specific segments based on their characteristics, behaviors, or attributes."
        },
        // System message should be the same for all the scenarios (easier to manage), after fine-tuning you can use the same system message foe inference, it is important that be the same message. This is the resson I mentioned that is like a tree structure.
        {
            "role": "user",
            "content": "Given the details of the persona, identify the segment to which the given persona belong to.\n\nGender is Male, Ever Married is Yes, Age is 37, Graduated, Profession is Executive has Work Experience of 0 Year, Spending Score is Average, Family Size is 3\n\nRule: The segment to which the persona belong to should be one of the following: [A, B, C, D].\n\nFormat: Please identify the segment to which the persona belong to, start with reasoning and place the result inside the [] for clarity."
        },
        // we provided a clear descriptive detail in the user message
        {
            "role": "assistant",
            "content": "Based on the characteristics provided, this is a relevent request and this persona belongs to the 'Urban Tech Professional' segment due to: age, occupation, location, and digital behavior patterns. so the segment is [B]"
        } // Assistant provides reasoning before providing the segment classification. This most of the time is helppful, sometimes is not, depends on the context
    ]
}

Issues:

  1. Model sometimes responds to literature queries during segmentation exam
    Solution: Provide context in user or system message
  2. Model might classify irrelevant inputs
    Solution: Need DPO optimization if you know what you are doing, or change the structure of the input and output to add extra cues that is indication of relevant or irrelevant. As example check my assistant structure.
  3. Inconsistent rejection of out-of-scope queries
    Solution: Same as previous

Here is an example of providing more context in the user or system message:

Valid Queries Should Include:
- Persona descriptions
- Individual characteristics
- Behavioral patterns
- Demographic information

Invalid Queries (Should Return "Irrelevant request"):
- General knowledge questions
- Technical support queries
- Or Any non-segmentation related questions

or:

Response Format:
   - For valid queries: Provide segment classification with reasoning
   - For invalid queries: Respond with "Irrelevant request"

Invalid Format:

User: "Can you classify this research paper about machine learning?"
Assistant: "Irrelevant request"

User: "What's the weather like today?"
Assistant: "Irrelevant request"

Fine-tuning LLMs is not as straightforward as input and output, and it is not something that you would implement by following guides. However, for classification (as in this case), it is simpler, as you can evaluate and iterate. You may get the results you want in a few iterations, but it is still not guaranteed. This process is true for any kind of model optimization.

Here is what I mean by tree structure:

Prompt: "The cat is"
(Showing top 3 probable tokens at each step with probabilities)

                                   "The"
                                     │
                                   "cat"
                                     │
                                   "is"
                                     │
                    ┌───────────────┼───────────────┐
                    │               │               │
              [0.4] "sleeping"  [0.3] "sitting" [0.2] "running"
                    │               │               │
         ┌──────────┼──────┐   ┌────┼────┐    ┌────┼────┐
         │          │      │   │         │    │         │
    [0.5] "on" [0.3] "in" │  [0.6] "on" │  [0.4] "in" │
         │          │      │   │         │    │         │
    ┌────┼────┐    │      │ ┌─┴─┐       │  ┌─┴─┐       │
    │         │    │      │ │   │       │  │   │       │
"the bed" "the mat" │      │"the"│       │"the" │       │
[0.4]    [0.3]     │      │[0.8]│       │[0.7] │       │
    │         │    │      │ │   │       │  │   │       │
"soundly" "quietly" │    "now" "mat"   "sofa" "park" "track"
[0.3]    [0.2]     │    [0.4][0.6]   [0.5] [0.4] [0.3]
    │         │    │      │  │        │    │     │
    ▼         ▼    ▼      ▼  ▼        ▼    ▼     ▼

Complete Sequences (with cumulative probabilities):
1. "The cat is sleeping on the bed soundly" (0.4 * 0.5 * 0.4 * 0.3 = 0.024)
2. "The cat is sitting on the mat" (0.3 * 0.6 * 0.8 * 0.6 = 0.0864)
3. "The cat is running in the park" (0.2 * 0.4 * 0.7 * 0.4 = 0.0224)

Legend:
├── Branch point (token options)
│   Vertical connection (token dependency)
[p] Probability of token given parent
▼   Continuation possible

Token Dependencies:
┌────────────────────────────────────────┐
│ Token Level │ Depends On               │
├────────────┼───────────────────────────┤
│ "The"      │ <START>                   │
│ "cat"      │ "The"                     │
│ "is"       │ "The cat"                 │
│ "sleeping" │ "The cat is"              │
│ "on"       │ "The cat is sleeping"     │
└────────────────────────────────────────┘

I think you do not even need fine-tuning and only with prompting you can get the result.

The problem with Fine-tuning is that its very difficult to decide on the failures in the fine-tuned model to be added for next fine-tuning iteration as the model behaves differently during two different inferences. Any advise on that.

If you have created the test set, then it should be easier to evaluate the models responses. If you need to fine-tune a model to use it as a conversational model, then this is not recommended to start with.

I checked your data last night, and I noticed that the main issue is how you structured your data. I recommend correcting the structure first.

I also think that you don’t need fine-tuning, and prompting should work.

Why prompting does not work for your case, have you tried it?

I have done this as per the fine-tuning example presented in the OpenAI guide. Can you please point out the problems in structure you mentioned in. Below is the JSONL(One from each scenario)

{“messages”: [{ “role”: “system”, “content”: "You are a specialized assistant focused solely on answering questions related to the librarian exam, specifically in Quantitative Finance. Do not engage in any conversations or provide information on topics outside this scope. All responses must be relevant to the librarian exam only " }, { “role”: “user”, “content”: “Which literature needs classification?” }, { “role”: “assistant”, “content”: "The Title is Identification of Conduit… Abstract is Due to economic globalization,… "}]}

{“messages”: [{ “role”: “system”, “content”: “You are a specialized assistant focused solely on answering questions related to persona segmentation, specifically from Segment A. Do not engage in any conversations or provide information on topics outside this scope. All responses must be relevant only to persona segmentation within Segment A.” }, { “role”: “user”, “content”: “Provide a persona to segment” }, { “role”: “assistant”, “content”: “Gender is Female,Ever Married is No, Age is 35, Graduated,Profession is Engineer has Work Experience of 4 Year, Spending Score is Low, Family Size is 1”}]}

╔═══ Librarian Exam - Quantitative Finance ══╗

System: You are a specialized assistant focused solely on answering questions related to the librarian exam, specifically in Quantitative Finance. Do not engage in any conversations or provide information on topics outside this scope. All responses must be relevant to the librarian exam only.

User: Which literature needs classification?

Assistant: The Title is Identification of Conduit... Abstract is Due to economic globalization,...

╚══════════════════════════════╝

  • What I dont understand is that there is no context here, how should assistant know that the title and abstract is as such.
  • You given a input as (system + user) message, and the assistant task is to classify this input into different categories. If so, you should provide the context in the user message.

╔════ Persona Segmentation - Segment A ═════╗

System: You are a specialized assistant focused solely on answering questions related to persona segmentation, specifically from Segment A. Do not engage in any conversations or provide information on topics outside this scope. All responses must be relevant only to persona segmentation within Segment A.

User: Provide a persona to segment

Assistant: Gender: Female, Ever Married: No, Age: 35, Education: Graduated, Profession: Engineer, Work Experience: 4 Years, Spending Score: Low, Family Size: 1

╚═════════════════════════════════╝

  • Here you should expect that the assistant will generate random data! rather than classifying something, same as previous example. I can not see any context here.