I trained a binary classifier with the new GPT-3.5-turbo fine-tuning API. Attached is an image of its performance.
After testing the model, it does adhere to the schema (1 or 2), but its performance seems wild and poor. I used ~700 quality samples.
I expect classifiers to require more samples than normal. Has anyones tested this? Any ideas where they see improvements? Any hyperparameter suggestions?
Consider that gpt-3.5-turbo comes with a lot of baggage: It has already been trained to chat, been trained to deny and apologize, and follow tons of different prompt styles and conversations.
For a non-chat single turn application, babbage-002 or davinci-002 would likely be far better. They take a simple input and output pair as fine-tuning.
Hmmm. Good point. One attractive thing about GPT-3.5-turbo is its higher ceiling than GPT-3.
Is this a “there’s not enough samples to fit in a JSONL file to unlearn 3.5’s instruction tuning” problem? Or is it that I might just have to quadruple my training set type problem?
I’ve tested different temp and top p on the trained model — again, it adheres to my requested schema, but classification accuracy is poor. (Are you talking about change/tuning temp/top-p while training?)
Good point on the (0,1) instead of (1,2). I’ll retrain with that.
Prompt:
f""" Below are two arguments.
The first is delimited by backticks ` .
The second is delimited by angle brackets <> .
State the number of the one you think is best written.
Argument 1:
`
{variable}
`
Argument 2:
<{variable}>
"""
Of course, binary classification like this is a nontrivial problem. GPT-4 seems to perform excellent, however. Curious to hear your thoughts!