Retraining of custom trained GPT 3.5 turbo model

  1. Custom training done for GPT 3.5 turbo model on set 1 data with 4 classes as labels. Getting 90% Accuracy test set.

  2. subsequently, retrained above custom trained model on set 2 with 6 new classes as labels. Getting 80% accuracy on test set from set 2.
    Whereas, 0% accuracy on set 1 test set.

  3. Why this new model is not incorporating learnings from set 1 and set 2 data ?

  4. Is there any way to have learnings from previous trained model?

In our case, training process will be ongoing as and when we received new data.

Hi @nahire - welcome to the Community.

Interesting question you are raising. I’m not 100% sure on this one but I suspect that when you fine-tune an existing model you wouldn’t be able to add new classes/labels.

I can say based on own experience that if you maintain the type of classes/labels and just add additional examples for each class/label when fine-tuning an existing fine-tuned model, it works well.

Perhaps others have first-hand experience.

That said, as this is an interesting point, I might run some tests tonight and can get back to you with additional observations then. In particular I wonder whether it would make a difference if you used a hybrid data set for fine-tuning that combines both examples with classes from the first data set and examples from the newly introduced classes.

1 Like

Hi again @nahire - I did end up running a quick but successful test with the “hybrid approach” using my dataset.

What I did was as follows:

  1. For the first fine-tuned model I used a dataset consisting of 60 text summaries, each classified into one of 3 topic labels (20 examples for each label). As expected, the model performed with 100% accuracy.

  2. I subsequently fine-tuned the newly created fine-tuned model with a second dataset consisting of 180 text summaries, now with each summary classified into one of 8 topic labels. For the 5 new topic labels I included 30 examples each, while for the existing 3 topic labels I included 10 additional examples each.

I then tested the final fine-tuned model and achieved approx. 95% accuracy in labeling (which is consistent with normal accuracy levels for my fine-tuned classification models as the number of labels increases).

One point to note is that in the training data I included a system message, which also included the labels to choose from. Across both datasets I kept the system message identical - the only difference was the list of labels. The syntax of the user prompt also remained the same across both datasets.

I hope this helps as an additional observation (even though it’s just a small dataset I tested it on).

P.S.: The obvious other choice you have to do classification are embeddings. So you may want to consider that as an option.

1 Like

Thanks @jr.2509 for you response. Great to hear about your successful test with the hybrid approach! I will also conduct trial with hybrid approach you have provided.

1 Like

Hi @jr.2509 ,

For the subsequent fine tuning after first fine tuned, I used the “Hybrid approach” for data preparation.

Where from Set 1 , I took 10 records each for 4 labels (40) and
from Set 2, I took 45 records each for New 4 labels ( 180).
So, I have total 220 training records.

But I am getting 25% accuracy on test set 1 from first data set and 50% accuracy on test set 2 from second data set. Whereas as I had 85% accuracy on test set 1 from first data set.

Can you please post the GPT code you have used for my reference?

Hi @nahire - before getting into the details, can I just ask if you fine-tuned a new model from scratch using the approach or did you use your existing fine-tuned model?

In any case, I’ll share with you shortly a disguised version of my system and user messages for reference.

Ok, so here’s the logic of my training data set for the test I’ve run:

Data set for initial version of the fine-tuned model

{“messages”: [
{“role”: “system”, “content”: “You are a helpful assistant. Your task is to classify the text by the user into one of the following pre-defined topic categories: Topic 1, Topic 2, Topic 3,”},
{“role”: “user”, “content”: “Text”},
{“role”: “assistant”, “content”: “Topic label”}]}

Composition of training data:
20 examples by topic


Data set for further fine-tuning the initial fine-tuned model

{“messages”: [
{“role”: “system”, “content”: “You are a helpful assistant. Your task is to classify the text by the user into one of the following pre-defined topic categories: Topic 1, Topic 2, Topic 3, Topic 4, Topic 5, Topic 6, Topic 7, Topic 8”},
{“role”: “user”, “content”: “Text”},
{“role”: “assistant”, “content”: “Topic label”}]}

Composition of training data:
10 examples for the pre-existing topic labels 1-3
30 examples for the new topic labels 4-8


So they are essentially identical. The only change under the second set of messages is the expanded list of topic labels.

Hi @jr.2509 ,
I used my existing fine-tuned model for new training with the “Hybrid Approach” of data.

Thanks for sharing the system and user messages for reference.
Let me add and try with system, user and assistant message.

Thanks for clarifying. For best test results I would suggest creating a fine-tuned model from scratch with this approach.

Sure @jr.2509 . Thanks for quick reply.

1 Like

Hi @jr.2509 ,Thanks. For your updates -
Hybrid Approach with system prompt message is working in my case also.

Below are details with Results:
Training 1: fine-tuned base model (gpt-3.5-turbo-1106) on Set 1 data of 4 lables (4 lables * 45 records each = 180 records)

Training 2 : further fine-tuning the initial fine-tuned model

Hybrid Approach for data preparation along with system prompt message as per reference you had provided previously.

from Set 1 Data , I took 40 records (4 * 10 records each = 40)
from Set 2 New Data , I took 180 records ( 4 new labels * 45 records each = 180)
So, New Set Data is : 40 + 180 = 220 records with 8 labels

Also, I have Added preprocessing on Input text as my input text is large in size and it is complex in nature.

Training 1 Accuracy : 85%
Training 2 Accuracy : Overall 68% ( Including Set 1 and Set 2 test data)
Set 1 : 65%
Set 2 : 75%

Note :
In my case I have to consider each label as one dictionary with 3 key value pairs.
So, input is large text and label is dictionary.
Still, I will be trying to improve this accuracy further.

1 Like

Thanks for the update @nahire - I’m glad to hear that it is going in the right direction and I hope that you will be able to improve further on accuracy.