As I understand it (based on https://github.com/openai/openai-cookbook/blob/main/examples/Fine-tuned_classification.ipynb and OpenAI Platform):
- a fine-tuned GPT-3 model is suitable for this task
- I need a lot of training data
- I should use the ada model as a base
- choose completions that include only 1 token if possible
We use a ticket system for customer service inquiries, in which customer service employees currently categorize each email (e.g. delivery date request, product A consultation, product B complaint, etc.). From this system, I extracted about 5,000 emails (initial emails from the customer without our responses/ further email correspondence) including the category.
The training data was created in this format:
{“prompt”: “<Subject: … Body: … >\n\n###\n\n”, “completion”: " "}
The 5,000 emails could be assigned to a total of 23 categories. The fine-tuning was successfully completed, and I can test another validation set of about 100 emails. However, these emails could only be correctly assigned to the category 70% of the time.
How can I significantly increase this accuracy rate?
0. Is a accuracy rate of > 95% realistic?
- Should I shorten the emails in the training data? (e.g. remove the signature or email history [customers sometimes reply with their questions to the order or shipping confirmation])
- Should I also do this for the completion requests of the validation set emails?
- What other measures can I take to increase the accuracy rate?
- Are there any detailed case studies online for this use case (as it should be a common use case for classification)?
Thank you!