Fine-tuning GPT-3 for email classification: seeking advice to improve accuracy

As I understand it (based on https://github.com/openai/openai-cookbook/blob/main/examples/Fine-tuned_classification.ipynb and OpenAI Platform):

  • a fine-tuned GPT-3 model is suitable for this task
  • I need a lot of training data
  • I should use the ada model as a base
  • choose completions that include only 1 token if possible

We use a ticket system for customer service inquiries, in which customer service employees currently categorize each email (e.g. delivery date request, product A consultation, product B complaint, etc.). From this system, I extracted about 5,000 emails (initial emails from the customer without our responses/ further email correspondence) including the category.

The training data was created in this format:

{“prompt”: “<Subject: … Body: … >\n\n###\n\n”, “completion”: " "}

The 5,000 emails could be assigned to a total of 23 categories. The fine-tuning was successfully completed, and I can test another validation set of about 100 emails. However, these emails could only be correctly assigned to the category 70% of the time.

How can I significantly increase this accuracy rate?
0. Is a accuracy rate of > 95% realistic?

  1. Should I shorten the emails in the training data? (e.g. remove the signature or email history [customers sometimes reply with their questions to the order or shipping confirmation])
  2. Should I also do this for the completion requests of the validation set emails?
  3. What other measures can I take to increase the accuracy rate?
  4. Are there any detailed case studies online for this use case (as it should be a common use case for classification)?

Thank you!

Hi ,

Were you able to find the solution for the above requirement.

Don’t start with ada. Start with davinci. Then work your way down to curie, babbage, and finally ada and compare their performance as you go.

Davinci should give you the best accuracy because it will understand the content in your emails the best. Lesser models can save cost if you add enough examples to achieve a satisfactory accuracy, which makes them great for iteration and optimization but not your MVP.

You might find a nice middle ground of quality, cost, and speed with curie.

Also, try ignoring the single token advice and using your full labels for your first fine-tune. Single token completions are something you can do, not something you have to do for classification. The semantic meaning of your labels can assist with smaller datasets (even though 5k emails is a pretty good size starting place).

Finally, I think you should have more than 100 validation examples for a dataset of 5,000 emails. Maybe 500 would be a better starting point.

I created a tool that can speed the whole process up, including the comparison of model performance, if you’re interested.