Multiple Scenario examples in fine-tuning

rao.ranganaths · January 8, 2025, 7:32am

Hello,

Request help on the below.

When we fine-tune a model by providing the training data with system, user and assistant messages, is it required to provide the system message once again at the inference time?
There are 2 scenario examples which are in the fine-tuning data. I am including
1. Valid Scenario 1 example + Outside of the Scenario 1 example and corresponding response.
2. Valid Scenario 2 example + Outside of the Scenario 2 example and corresponding response.
In this case during inference time there should be selection of the scenario(1 or 2). Should I add System message in inference to exclude questions apart from Scenario 1 & 2 or the Scenario 1 & 2 examples in fine-tuning only takes care of this?
What proportion of invalid question examples in the above case be included in the training data in addition to valid questions. I have tried out with 20% of invalid question-response
examples in training data, but at inference time much of the answers are that of invalid questions even when the user question is valid.

rao.ranganaths · January 10, 2025, 1:48pm

Request reply on the question.

sergeliatko · January 10, 2025, 6:48pm

Sorry, in having a hard time getting your approach. Can you be more specific and explain what you’re doing exactly, some examples, the final goals, etc. that would drastically help me to get the context and maybe help you.

kevin6 · January 13, 2025, 5:32pm

When we fine-tune a model by providing the training data with system, user and assistant messages, is it required to provide the system message once again at the inference time?

Yes.

It depends on your data, the model, and the type of fine-tuning.

Were all the system prompts the same? If your data is small, try to use as few diverse system prompts as possible.
How many examples did you have for each scenario? For each system prompt, consider at least 5 distinct examples on different topics/cases
If it’s about a scenario, first check the loss on SFT, then if it was less than 0.5, then go for DPO. The DPO will give you the conditional results you are looking for, SFT is useful when loss is higher.

The safest way for fine-tuning is to use the model’s own outputs, otherwise it will go outside of your control.

The model does not respect your defined scenarios; it depends on different factors. The best data for debugging is attaching loss and validation relation data here, which will show how the model will react to your data.

rao.ranganaths · January 14, 2025, 4:28pm

Thanks for the response. I am using LLM for assessing a student. I want to conduct 2 exams(Librarian, Segmentation). These are the two scenarios I mentioned. I have kept it to single turn and in future have to enhance this to multi-turn. The two scenarios together has 6373 JSON Lines.

Fine-tuning Trial 1 - Librarian(108 JSON lines) + Segmentation(6264 JSON lines), Training loss -0.2500, All system and user prompt same

When a candidate Comes for the Librarian exam. He/she is thrown the tiltle and abstract of the literature and based on this the examinee has to identify to which subject domain it belongs to. I have Computer Science, Physics, Mathematics, Statistics and 2 other domains’ title and abstract. The examinee has to identify the domain to which the literature belong to.

{“messages”: [ { “role”: “system”, “content”: “The Exam is for librarian and the literature belong to the domain of Quantitative Finance” },
{ “role”: “user”, “content”: “Which literature needs classification?” },
{ “role”: “assistant”, “content”: "The Title is Expropriations… and the Abstract is Using the Panama Papers… "}]}

Each domain(Computer Science, Physics, Mathematics, Statistics...) examples were 18 in number. All together for all domains it was 108 JSON lines

When a candidate comes for the segmentation test. He/she is thrown details of a person like age, marital status, Profession, Experience and others. Based on these details, the user has to identify a segment(A,B,C and D) to which the person can be mapped.

{ “messages”: [{ “role”: “system”, “content”: “Segmentation is B and the Exam is for testing Persona categorisation”},
{ “role”: “user”, “content”: “Which Persona needs classification?” },
{ “role”: “assistant”, “content”: “Gender is Male, Ever Married is Yes, Age is 37,
Graduated, Profession is Executive has Work Experience of 0 Year, Spending Score is Average, Family Size is 3” }]}

In the System message during the inference
(i) It was told to consider queries related to librarian/Segmentation exam and nothing else. It still accepted queries related to the other exam and also any general questions. It ignored the system message content.
(ii) It was told to consider queries related to librarian exam only and was shown how to respond if it belonged to segmentation of generic queries. It still ignores the system message and considers all the queries.

Fine-tuning Trial 2 - Librarian(30 JSON lines) + Segmentation(46 JSON lines), Training loss - 0.2734, All system and user prompt same

Within the fine-tuning, an example each of how to respond to unrelated query for each of the domain was shown

Librarian Exam, user questions about persona and the assistant responds as irrelavent.
{“messages”: [ { “role”: “system”, “content”: “The Exam is for librarian and the literature belong to the domain of Computer Science” },
{ “role”: “user”, “content”: “Which Persona needs classification?” },
{ “role”: “assistant”, “content”: “Irrelevant request” }]}

Segmenation Exam, user questions about literature and the assistant responds as irrelavent.

{“messages”: [{ “role”: “system”, “content”: “Segmentation is C and the Exam is for testing Persona categorisation” },
{ “role”: “user”, “content”: “Which literature needs classification?” },
{ “role”: “assistant”, “content”: “Irrelevant request”}]}

Few general questions were also added with corresponding assisant message as irrelevant response. 11 JSON Lines were examples of irrelevent response out of total 76

(i) The system became very restrictive and for any valid question it answered as irrelevant.

Fine-tuning Trial 3 - Librarian(60 JSON Lines) + Segmentation(40 JSON Lines), Training loss - 1.3905, All system and user prompt same

The fine-tuning system message itself was made restrictive and instructed to focus only on the related exam only.

{“messages”: [{ “role”: “system”, “content”: “You are a helpful assistant focused on answering questions related to the librarian exam, specifically in Computer Science. Keep responses relevant, accurate, and avoid topics outside the exam scope.” },
{ “role”: “user”, “content”: “Which literature needs classification?” },
{ “role”: “assistant”, “content”: “The Title is Contextual Regression: An Accurate … and the Abstract is Machine learning algorithms…”}]}

{“messages”: [{ “role”: “system”, “content”: “You are a helpful assistant focused on answering questions related to the persona segmentation, specifically from Segment D. Keep responses relevant, accurate, and avoid topics outside the exam scope.” },
{ “role”: “user”, “content”: “Which Persona needs classification?” },
{ “role”: “assistant”, “content”: “Gender is Female,Ever Married is No, Age is 27, No Graduation,Profession is Homemaker has Work Experience of 8 Year, Spending Score is Low, Family Size is 1”}]}

Still was not able to confine it to the exam it is set to.

I should be able to

a. Set the LLM to the kind of exam it is conducting and has to respond to relevant exam queries only.
b. In addition to the separation between the exams, it also has to exclude queries beyond the two exams.
c. In future has to be expanded to multi-turn evaluating he answer of domain to which the literature it belong to in the librarian exam and the segment to which a persona maps to in segmentation exam.

Thanks once again.

rao.ranganaths · January 14, 2025, 4:29pm

Thank you for the response. I have provided the details below.

kevin6 · January 15, 2025, 4:20pm

Thanks for the response. I am using LLM for assessing a student. I want to conduct 2 exams(Librarian, Segmentation). These are the two scenarios I mentioned. I have kept it to single turn and in future have to enhance this to multi-turn. The two scenarios together has 6373 JSON Lines.

Since the loss is less than 0.5, it means that the model can understand your work, but accuracy is not guaranteed, this is why you might need DPO.

I add some general rules, based on your data and the context you have.

Prompt the model on how it should behave and what should or should not be done. Categorize each scenarios or conditions you have. If you can keep the system prompt the same then would be helpful, if not only change the last words (given distinct elements of each scenario)
Provide context and ask the model again based on the context and other rules and preferences you have, how should model make assessment.
The output you expect from the model. This part might be a little sensitive, so try to use synthetic data from the same model with little changes.

Create a good balance of scenarios. Balance and diversity will help the model generalize better.
If you use JSON as output?, then you can pass it as a function or if it’s not a function, precede the JSON file with something like "Here is the assessment given json schemajson ... {json details} ... ". If you mean JSONL lines, then this refers to 6k and 100 instance examples? Not sure that you mean json output.
Convert your data into training, validation, and test sets and try SFT fine-tunin first.
After fine-tuning, test the model with the test set, then create a preference dataset based on correct and incorrect responses from the fine-tuned model.
Use the preference dataset to train and optimize the model toward what you want or don’t want.
Repeat the DPO process until you get the desired responses from the model.

I would check this post once again and write the better response as soon as I can.

rao.ranganaths · January 17, 2025, 1:24pm

Thanks for the reply.

I created a good balance between the scenarios so that both were equally represented. I didn’t split the sample as I actually dont know how the
train-test work in SFT. The one test-train split I did had very high validation loss of 1.43. I believe that this is because of the generalisation
capability of the text generator which results in not matching any of the domain literature/persona which is present in the testing sample. Please
validate this understanding.
I had very less examples of irrelevant responses for both the scenarios.

There were some wrong responses and now to correct the model as mentioned using DPO.I have a scenario where

a. The System message was that of librarian exam
The user query was “Give me a persona and I’ll classify”

but the assistant gave me "The Title is Irreversible single-ion switches enable a variety of non-volatile electronic applications. However, there is…".

b. The System message was that of librarian exam
The user query was a general knowledge Qn “Where was the last olympics held”

but the assistant gave me "The Title is Improving Zero-shot Recognition via Semantic-visual Relation Graph with Graph Neural Networks and the Abstract is  Existing zero-shot…"

How do I set the preferred and non-preferred o/p in this case. Also the assistant answer on the literature above is not specific as it can generalise. Also, is there any elaborate literature on fine-tuning as I find the content available on the OpenAI site insufficient.

kevin6 · January 17, 2025, 6:20pm

If the system prompt is diverse, then more of the model parameters will change during fine-tuning (in most cases not a good idea). Because GPT-4o or GPT-4o mini are already fine-tuned, you may want to keep this optimization?, so you should only consider one system prompt or with little change. Each possible output for a prompt will be a child node. Key point is that each child should recognizable for its parent (In simpler term: use the model output for assistants messages)

It is language, and language is flexible, you can put it all in one system prompt. if you structure the fine-tuning data correctly, you increasing the performance.

system prompt: [description of the classification, rules, preferences, instructions, etc] Do not put context here.

user prompt: context + create description what assistant should do.

Asssistant: clear deterministic response you expect from the model( Use synthatic data, the models own output)

The more you alter the model by diverity of system prompt, the more illogical the model gets, why? because the model has been optimized before and you breaking this optimization by fine-tuning it again. So the Key is to fine-tune it close to the base model (already optimized model like GPT-4o or GPT-4o mini).

It may works like a tree structure with parent and child nodes.
Each prompt has unlimited or limited child nodes. If the system prompts are diverse, the number of parents will be more, meaning more parameters of the model should adapt themselves to the fine-tuning data. If you limit the number of system prompts (1-2 depending on the dataset), then the model will quickly adapt to your dataset with fewer samples (even with five, you can achieve what you need).

Training Set

The primary dataset used to train the model which model learns patterns, features, and relationships from this data, It typically comprises 60-80% of the total available data, Traning set is conversation data which is used to fine-tune the model, these set will change the model parameters and weights, Let call it fine-tuning dataset.

Validation Set

It is usually about 10-20% of the total data, It will not be used to change the model parameters and weights, It will be used to evaluate the model performance, generalization, during training (fine-tuning) process, it is the unseen data that model does not know about.

Each step of fine-tuinng will change the model parameters, so we use this set to see how traning data changes the model parameters and weights.

Test Set

Used for final evaluation of the model’s performance, after the model is trained/fine-tuned, we use this set to evaluate the model’s performance. Same as evaluation set, or testing set that you will use to evaluate the model’s performance. This is also 10-20% of the total data. And can be part of the training set or not, depends on what you want to achieve.

Example of data splitting:

# For a dataset with 1000 samples
train_data = data[:600]      # 60% for training
val_data = data[600:800]     # 20% for validation
test_data = data[800:]       # 20% for testing

In most cases:

The test set should not be used during training, unless you want to see how the model performs on the training data.
Data should be randomly shuffled before splitting into different sets.
The splits should maintain the same distribution of classes/categories/topics/scenarios as the original dataset

Low and high validation loss:

High Validation Loss (e.g., > 1.0)

Indicates the model is struggling to generalize to unseen data
Common causes:
- Model is overfitting to training data
- Training and validation data are too different
- System prompts are too diverse or inconsistent
- Model is confused by contradictory examples

Example scenario:

Training: Model learns specific librarian responses
Validation: When seeing new questions, model gives very different responses
Result: High loss because predictions don't match expected outputs well

Low Validation Loss (e.g., < 0.5)

Indicates good generalization to unseen data
Signs of healthy learning:
- Model can handle new, similar prompts well
- Responses are consistent with training examples
- System prompts are being followed correctly
- Model has learned underlying patterns, not just memorized

Example scenario:

Training: Model learns librarian response patterns
Validation: When seeing new questions, model gives similar-style responses
Result: Low loss because predictions align well with expected outputs

example:

# Good Learning Curve
epoch 1: training_loss = 1.2, validation_loss = 1.1
epoch 5: training_loss = 0.6, validation_loss = 0.5
epoch 10: training_loss = 0.4, validation_loss = 0.4

# Bad Learning Curve
epoch 1: training_loss = 1.2, validation_loss = 1.1   # Model already knows the data
epoch 5: training_loss = 0.3, validation_loss = 1.4   # Validation gets worse, there is no generalization, going in wrong direction
epoch 10: training_loss = 0.1, validation_loss = 1.8  # Big gap indicates issues, different between training and validation data

Key Points:

Aim for validation loss to decrease along with training loss, this will be a good generalization
A large gap between training and validation loss suggests problems
Consider reducing system prompt variety if validation loss stays high

rao.ranganaths · January 19, 2025, 5:06pm

Thank you for the long response. Request you to suggest a literature to know more about the fine-tuning in addition to the one available on OpenAI docs.

kevin6 · January 20, 2025, 1:49am

I was trying to find a resource, I coudn’t find something useful or practical. The issue is how you structure your data.

System: Role: input classifier (as Librarian exam) Task: classify the given user's inputs into provided categories.

Prefix_Librarian = Identify the domain to which the literature belong to, given the title and abstract.
Suffix_Librarian = Rule: The domain to which the literature belong to should be one of the following: [Computer Science, Physics, Mathematics, Statistics, other domain, next domain].
Librarian_Context = Title: The Title is Expropriations… Abstract: Using the Panama Papers…

format_Librarian = Please identify the domain to which the literature belong to, given the title and abstract and format your response [specific your format here]

User: Prefix_Librarian + “\n” + Librarian_Context + “\n” + Suffix_Librarian + format_Librarian

Assistant: The domain to which the literature belong to is [Computer Science]

System: Role: input classifier (as Segmentation exam) Task: Categorize personas/individuals into specific segments based on their characteristics, behaviors, or attributes.

Prefix_Segmentation = Given the details of the persona, identify the segment to which the persona belong to.
Suffix_Segmentation = Rule: The segment to which the persona belong to should be one of the following: [A, B, C, D].
Segmentation_Context = Gender is Male, Ever Married is Yes, Age is 37, Graduated, Profession is Executive has Work Experience of 0 Year, Spending Score is Average, Family Size is 3

format_Segmentation = Please identify the segment to which the persona belong to, start with reasoning and place the result inside the [] for clarity.

User: Prefix_Segmentation + “\n” + Segmentation_Context + “\n” + Suffix_Segmentation + format_Segmentation

Assistant: Based on the characteristics provided, this is a relevent request and this persona belongs to the 'Urban Tech Professional' segment due to: age, occupation, location, and digital behavior patterns. so the segment is [B] // We provide reasoning before providing the segment classification if there no logical relationship between the persona and the segment then no need to provide reasoning.

Example Structure:

{
    "messages": [
        {
            "role": "system",
            "content": "Categorize personas/individuals into specific segments based on their characteristics, behaviors, or attributes."
        },
        // System message should be the same for all the scenarios (easier to manage), after fine-tuning you can use the same system message foe inference, it is important that be the same message. This is the resson I mentioned that is like a tree structure.
        {
            "role": "user",
            "content": "Given the details of the persona, identify the segment to which the given persona belong to.\n\nGender is Male, Ever Married is Yes, Age is 37, Graduated, Profession is Executive has Work Experience of 0 Year, Spending Score is Average, Family Size is 3\n\nRule: The segment to which the persona belong to should be one of the following: [A, B, C, D].\n\nFormat: Please identify the segment to which the persona belong to, start with reasoning and place the result inside the [] for clarity."
        },
        // we provided a clear descriptive detail in the user message
        {
            "role": "assistant",
            "content": "Based on the characteristics provided, this is a relevent request and this persona belongs to the 'Urban Tech Professional' segment due to: age, occupation, location, and digital behavior patterns. so the segment is [B]"
        } // Assistant provides reasoning before providing the segment classification. This most of the time is helppful, sometimes is not, depends on the context
    ]
}

Issues:

Model sometimes responds to literature queries during segmentation exam
Solution: Provide context in user or system message
Model might classify irrelevant inputs
Solution: Need DPO optimization if you know what you are doing, or change the structure of the input and output to add extra cues that is indication of relevant or irrelevant. As example check my assistant structure.
Inconsistent rejection of out-of-scope queries
Solution: Same as previous

Here is an example of providing more context in the user or system message:

Valid Queries Should Include:
- Persona descriptions
- Individual characteristics
- Behavioral patterns
- Demographic information

Invalid Queries (Should Return "Irrelevant request"):
- General knowledge questions
- Technical support queries
- Or Any non-segmentation related questions

or:

Response Format:
   - For valid queries: Provide segment classification with reasoning
   - For invalid queries: Respond with "Irrelevant request"

Invalid Format:

User: "Can you classify this research paper about machine learning?"
Assistant: "Irrelevant request"

User: "What's the weather like today?"
Assistant: "Irrelevant request"

Fine-tuning LLMs is not as straightforward as input and output, and it is not something that you would implement by following guides. However, for classification (as in this case), it is simpler, as you can evaluate and iterate. You may get the results you want in a few iterations, but it is still not guaranteed. This process is true for any kind of model optimization.

Here is what I mean by tree structure:

Prompt: "The cat is"
(Showing top 3 probable tokens at each step with probabilities)

                                   "The"
                                     │
                                   "cat"
                                     │
                                   "is"
                                     │
                    ┌───────────────┼───────────────┐
                    │               │               │
              [0.4] "sleeping"  [0.3] "sitting" [0.2] "running"
                    │               │               │
         ┌──────────┼──────┐   ┌────┼────┐    ┌────┼────┐
         │          │      │   │         │    │         │
    [0.5] "on" [0.3] "in" │  [0.6] "on" │  [0.4] "in" │
         │          │      │   │         │    │         │
    ┌────┼────┐    │      │ ┌─┴─┐       │  ┌─┴─┐       │
    │         │    │      │ │   │       │  │   │       │
"the bed" "the mat" │      │"the"│       │"the" │       │
[0.4]    [0.3]     │      │[0.8]│       │[0.7] │       │
    │         │    │      │ │   │       │  │   │       │
"soundly" "quietly" │    "now" "mat"   "sofa" "park" "track"
[0.3]    [0.2]     │    [0.4][0.6]   [0.5] [0.4] [0.3]
    │         │    │      │  │        │    │     │
    ▼         ▼    ▼      ▼  ▼        ▼    ▼     ▼

Complete Sequences (with cumulative probabilities):
1. "The cat is sleeping on the bed soundly" (0.4 * 0.5 * 0.4 * 0.3 = 0.024)
2. "The cat is sitting on the mat" (0.3 * 0.6 * 0.8 * 0.6 = 0.0864)
3. "The cat is running in the park" (0.2 * 0.4 * 0.7 * 0.4 = 0.0224)

Legend:
├── Branch point (token options)
│   Vertical connection (token dependency)
[p] Probability of token given parent
▼   Continuation possible

Token Dependencies:
┌────────────────────────────────────────┐
│ Token Level │ Depends On               │
├────────────┼───────────────────────────┤
│ "The"      │ <START>                   │
│ "cat"      │ "The"                     │
│ "is"       │ "The cat"                 │
│ "sleeping" │ "The cat is"              │
│ "on"       │ "The cat is sleeping"     │
└────────────────────────────────────────┘

I think you do not even need fine-tuning and only with prompting you can get the result.

rao.ranganaths · January 20, 2025, 11:05am

The problem with Fine-tuning is that its very difficult to decide on the failures in the fine-tuned model to be added for next fine-tuning iteration as the model behaves differently during two different inferences. Any advise on that.

kevin6 · January 20, 2025, 2:13pm

If you have created the test set, then it should be easier to evaluate the models responses. If you need to fine-tune a model to use it as a conversational model, then this is not recommended to start with.

I checked your data last night, and I noticed that the main issue is how you structured your data. I recommend correcting the structure first.

I also think that you don’t need fine-tuning, and prompting should work.

Why prompting does not work for your case, have you tried it?

rao.ranganaths · January 20, 2025, 4:58pm

I have done this as per the fine-tuning example presented in the OpenAI guide. Can you please point out the problems in structure you mentioned in. Below is the JSONL(One from each scenario)

{“messages”: [{ “role”: “system”, “content”: "You are a specialized assistant focused solely on answering questions related to the librarian exam, specifically in Quantitative Finance. Do not engage in any conversations or provide information on topics outside this scope. All responses must be relevant to the librarian exam only " }, { “role”: “user”, “content”: “Which literature needs classification?” }, { “role”: “assistant”, “content”: "The Title is Identification of Conduit… Abstract is Due to economic globalization,… "}]}

{“messages”: [{ “role”: “system”, “content”: “You are a specialized assistant focused solely on answering questions related to persona segmentation, specifically from Segment A. Do not engage in any conversations or provide information on topics outside this scope. All responses must be relevant only to persona segmentation within Segment A.” }, { “role”: “user”, “content”: “Provide a persona to segment” }, { “role”: “assistant”, “content”: “Gender is Female,Ever Married is No, Age is 35, Graduated,Profession is Engineer has Work Experience of 4 Year, Spending Score is Low, Family Size is 1”}]}

kevin6 · January 20, 2025, 8:16pm

╔═══ Librarian Exam - Quantitative Finance ══╗

System: You are a specialized assistant focused solely on answering questions related to the librarian exam, specifically in Quantitative Finance. Do not engage in any conversations or provide information on topics outside this scope. All responses must be relevant to the librarian exam only.

User: Which literature needs classification?

Assistant: The Title is Identification of Conduit... Abstract is Due to economic globalization,...

╚══════════════════════════════╝

What I dont understand is that there is no context here, how should assistant know that the title and abstract is as such.
You given a input as (system + user) message, and the assistant task is to classify this input into different categories. If so, you should provide the context in the user message.

╔════ Persona Segmentation - Segment A ═════╗

System: You are a specialized assistant focused solely on answering questions related to persona segmentation, specifically from Segment A. Do not engage in any conversations or provide information on topics outside this scope. All responses must be relevant only to persona segmentation within Segment A.

User: Provide a persona to segment

Assistant: Gender: Female, Ever Married: No, Age: 35, Education: Graduated, Profession: Engineer, Work Experience: 4 Years, Spending Score: Low, Family Size: 1

╚═════════════════════════════════╝

Here you should expect that the assistant will generate random data! rather than classifying something, same as previous example. I can not see any context here.

rao.ranganaths · January 21, 2025, 7:29am

Actually it is the other way. The candidate who sits for the Librarian exam is given by LLM the Title and the Abstract. Here the LLM is generalizing it very well as it is throwing literature details beyond the trained example ones. This is a simple prototype being developed. Based on the title and abstract the student has to identify the domain and answer accordingly. As a first step and to keep it simple I have tried single-turn. Going forward the LLM has to validate the answer. Similarly for the segmentation problem. The student who is supposed to identify the category based on various parameter of a person and answer accordingly. Here also I have kept it as a single-turn. Hope this clarifies. Thank you for the response.
Please validate the sample JSON I sent earlier in this context and please provide feedback.

kevin6 · January 21, 2025, 1:45pm

Librarian Exam:

LLM provides title and abstract to the student => generation task => [Assistant]
Student must identify the correct domain (CS, Physics, Math, etc.) => classification task [User]
Currently single-turn (future: LLM will validate student’s answer) => Evaluation task => [Assistant]

Segmentation Exam:

LLM provides persona details to the student => generation task => [Assistant]
Student must identify the correct segment (A, B, C, D) => classification task [User]
Currently single-turn (future: LLM will validate student’s answer) => Evaluation task => [Assistant]

Generation task:
LLMs can generate titles and abstracts for the Librarian exam. If you have fine-tuned it, the LLM will generate random titles and abstracts. If this is what you want, then it should be OK. I suggest create a more balanced dataset with a 100:100 ratio for each system message.

You can create a DPO dataset (10-20 examples for edge cases) where the rejected/non-referred data is evaluation examples that the LLM failed to generate/classify, and the accepted/preferred set is the training dataset examples that the LLM correctly generated/classified. This will help to optimize the model, if is not done properly it may create an overfitting issue, meaning the LLM will tend to generate a specific abstract and title. You can run another SFT afterward to mitigate the overfitting.

Message sequence:

System provides instructions
User provides input → Here what the the exact input? “Which literature needs classification?” If there is a possibility that the user as student who take the exam enters this phrase “Which literature needs classification?” then you should provide rephrased versions of the same phrase for your fine-tuning dataset e.g. in one message can be “I am here to take the exam for classification, Please create a abstract for me”.
Assistant generates title and abstract
User provides classification input.
Assistant validates the classification. → here you may use function calling to validate the classification.

My suggestion is to remove any specific details related to the type of exam, such as Quantitative Finance, from the system prompt and create only two distinct system messages for the Librarian and Segmentation exams. This message should be the same in both inference and training. Then, the user will provide the topic. You can add the topic as a suffix to the user message text sent into the model. Specifying the type of exam, whether it is Librarian or Segmentation, will help the model understand.

rao.ranganaths · January 23, 2025, 1:30pm

“My suggestion is to remove any specific details related to the type of exam, such as Quantitative Finance, from the system prompt and create only two distinct system messages for the Librarian and Segmentation exams. This message should be the same in both inference and training.”

With this input in the training data, I am teaching it to associate the domain name of the literature with the content in the assistant message. Is this not required to show the pattern to the model? If it is added in user message then the training and inference system messages would be different.

kevin6 · January 23, 2025, 7:43pm

then, at least keep the half of the system prompt identical to each other, and the other half different, like specifying topics.

Something like:

System: You are a helpful librarian. You create exam title and abstracts. Next instruction will follow. [Specifying Topic]

And keep the rest diverse as possible.

If you use only SFT, You need a good balance of each, like 50/50, the more balance the better it gets.

Topic		Replies	Views
QA fine-tuned chatbot not answering from the trained data but nonfactual API	73	19103	November 24, 2023
Finetuning not working? Prompting	8	2590	December 24, 2023
Prevent hallucination with gpt-3.5-turbo API	29	14024	December 19, 2023
Gpt-3.5-turbo fine tuning help needed, very difficult situations API	6	1755	September 13, 2023
Do you fine tune? If so why? API	34	4661	December 25, 2023

Multiple Scenario examples in fine-tuning

Example Structure:

Related topics