Any tips to improve my fine-tuned model? (GPT annotation)

Hello, we are trying to make a multi-class classification model which classifies company review in Jobplanet, which is basically a Korean glassdoor. I saw some recent research where people use GPT to annotate labels to reduce cost and time.

So we have done human annotation for 660 company reviews. And the labels are follow:

“1. growth potential and vision: long-term growth potential of the company, business expansion, competitiveness in the industry, vision, future direction of the company, personal growth potential and \n”
“2. Benefits and salary: salary levels, bonuses, salary increases, welfare benefits, health insurance, annual leave, and in-house training offered to employees \n”
“3. Work environment and WLB: working environment, work-life balance, work intensity, working hours, possibility to work from home, breaks, office facilities, location of workplace, mobility between organizations, fatigue \n”
“4. Company culture: company atmosphere, office politics (lines), working style, relationships between employees, communication style, collaboration style, free annual leave, horizontal, vertical, reporting paperwork, retention (L-Mun) - system \n”
“5. Management (Leadership): Compensation/performance evaluation, Personnel policy, Decision-making style, Management strategy, Consideration for employees, Recruitment of new employees, Hierarchy/system, Lack of promotion Appropriate timing: Because management should organize the organization at the right time, Employment retention (Elmuwon)-Systems\n”
“6. Other: Text not applicable to above\n\n”

I have followed OpenAI finetuning API and made a train/val/test set 300/100/260. And made a JSONL file following the format of System, user, assistance. Below results are the system message of the model.

Below results shows the training&validation results of fine-tuning gpt-4o-mini.

The results in training and validation seems alright but when we test on test_dataset the results are really bad as follow:

accuracy 0.08 260
macro avg 0.06 0.06 0.05 260
weighted avg 0.09 0.08 0.08 260
[eval of fine-tune model]

accuracy 0.02 260
macro avg 0.00 0.00 0.00 260
weighted avg 0.05 0.02 0.02 260
[eval of baseline:GPT-4o mini]

The good thing is that fine-tune model was better than baseline model(GPT-4o mini). However, we are expecting the fine-tune model to be at least 0.7 accuracy to use it as annotator.

We try to analyze the problems and the followings are what we came up of:
(main)The multi-class classification is challenging&vague, some classes overlap and when doing human-annotation the standard were also vague. Human annotation should be re-established.

(main) we have only tried GPT4o-mini and could try GPT3.5, GPT4o etc… and hyperparameters

(sub) When fine-tuning the text has both English and Korean which makes the model to understand and learn

So, our main questions are

  1. Should we do humman annotation again?
  2. Is it better to make system message more simple?
  3. Should we try fine-tuning GPT3.5-turbo, GPT-4o instead, maybe change hyperparameter(learning late, epoch, etc…)
  4. Is it better to use single language on system, user, assistance?
  5. Any tips or comments would be appreciated!

Thank You for your help!

1 Like

Welcome to the Forum!

Some immediate reactions:

Generally speaking, the quality of your training data set is quite a critical success factor. So if you find it did not meet expectations, then his likely has a material bearing on the quality of your fine-tuned model.

That said, would you be able to share your full system prompt as well? You’ve shared the labels but I’d like to take a look at the rest of the instructions if there’s anything that could contribute to the issue. Additionally, it would be helpful if you could share an example of the exact output you are expecting the model to return.

Thanks!

Thank you for the reply!
Below is the screenshot of the prompt. The Korean part is where the label descriptions are given.

Also, we figured out that our model was too dificult since it is a multi-class classification and we looked at some outputs and saw some improvements, instead we try to look at total accuracy(corrected cell/all cell)

and the results seems reasonable. For baseline(GPT-4o mini) the accuracy was
precision recall f1-score support

       0       0.82      0.95      0.88      1192
       1       0.68      0.34      0.45       368

accuracy                           0.81      1560

macro avg 0.75 0.64 0.67 1560
weighted avg 0.79 0.81 0.78 1560

And the finetuning model accuracy was
precision recall f1-score support

       0       0.95      0.94      0.95      1192
       1       0.82      0.85      0.83       368

accuracy                           0.92      1560

macro avg 0.88 0.90 0.89 1560
weighted avg 0.92 0.92 0.92 1560
which shows improvement.

Thank you for sharing the additional info.

I think your system prompt is generally fine. I would likely consolidate it a bit further to avoid repetitions in instructions. Here’s one option for a refined version.

Refined system message:

You are an AI expert in company reviews. You are provided with an unlabeled sentence and required to classify it into one or multiple of the following six pre-defined categories: [Placeholder for category descriptions]. Your response consists of the category label(s), strictly only using the defined category terms. In case of multiple labels, you separate these by comma.

Based on my my own experience with fine-tuning for multi-classification, I’d say that besides the quality of the training data set, the composition of the training data is relatively important. While I think the volume of 660 examples should work well given the number of labels, you want to ensure that there is sufficient balance and avoid an overrepresentation of any labels. Also, in case there is a high variety of the writing style of the input sentences, you want to make sure that this diversity, too, is reflected in your data set.

The other point worth noting is that you want to have a closer look at those examples that are inaccurately labelled by the fine-tuned model. Based on the insights, you can consider expanding your training set with more edge cases that specifically target these inaccuracies.

Hi @jaehyoyi1!

This is really interesting work! I’d like to ask how you evaluated your model on the test data and obtained the detailed metrics. Did you use the OpenAI eval framework for this? or via opeanai playground?

Thank you in advance!