Fine-tuning GPT-3 for email classification: seeking advice to improve accuracy

torstent · February 17, 2023, 8:57pm

As I understand it (based on https://github.com/openai/openai-cookbook/blob/main/examples/Fine-tuned_classification.ipynb and OpenAI Platform):

a fine-tuned GPT-3 model is suitable for this task
I need a lot of training data
I should use the ada model as a base
choose completions that include only 1 token if possible

We use a ticket system for customer service inquiries, in which customer service employees currently categorize each email (e.g. delivery date request, product A consultation, product B complaint, etc.). From this system, I extracted about 5,000 emails (initial emails from the customer without our responses/ further email correspondence) including the category.

The training data was created in this format:

{“prompt”: “<Subject: … Body: … >\n\n###\n\n”, “completion”: " "}

The 5,000 emails could be assigned to a total of 23 categories. The fine-tuning was successfully completed, and I can test another validation set of about 100 emails. However, these emails could only be correctly assigned to the category 70% of the time.

How can I significantly increase this accuracy rate?
0. Is a accuracy rate of > 95% realistic?

Should I shorten the emails in the training data? (e.g. remove the signature or email history [customers sometimes reply with their questions to the order or shipping confirmation])
Should I also do this for the completion requests of the validation set emails?
What other measures can I take to increase the accuracy rate?
Are there any detailed case studies online for this use case (as it should be a common use case for classification)?

Thank you!

linpaulkt · July 13, 2023, 5:48pm

Hi ,

Were you able to find the solution for the above requirement.

markhennings · July 13, 2023, 7:57pm

Don’t start with ada. Start with davinci. Then work your way down to curie, babbage, and finally ada and compare their performance as you go.

Davinci should give you the best accuracy because it will understand the content in your emails the best. Lesser models can save cost if you add enough examples to achieve a satisfactory accuracy, which makes them great for iteration and optimization but not your MVP.

You might find a nice middle ground of quality, cost, and speed with curie.

Also, try ignoring the single token advice and using your full labels for your first fine-tune. Single token completions are something you can do, not something you have to do for classification. The semantic meaning of your labels can assist with smaller datasets (even though 5k emails is a pretty good size starting place).

Finally, I think you should have more than 100 validation examples for a dataset of 5,000 emails. Maybe 500 would be a better starting point.

I created a tool that can speed the whole process up, including the comparison of model performance, if you’re interested.

Topic		Replies	Views
Help with fine-tuning for text categorization API	4	1212	December 16, 2023
Looking for help with prompt optimization! Prompting	12	1126	May 10, 2022
How to improve a fine-tune classifier? Prompting	10	1329	August 15, 2022
Finetuning experiments (How not to finetune) API fine-tuning	2	714	December 21, 2023
Issues with Fine-Tuned Babbage-002 Model Returning Incorrect Completions Prompting gpt-4 , chatgpt	13	1726	December 21, 2023

Fine-tuning GPT-3 for email classification: seeking advice to improve accuracy

Related topics