I trained a binary classifier with the new GPT-3.5-turbo fine-tuning API. Attached is an image of its performance.
After testing the model, it does adhere to the schema (1 or 2), but its performance seems wild and poor. I used ~700 quality samples.
I expect classifiers to require more samples than normal. Has anyones tested this? Any ideas where they see improvements? Any hyperparameter suggestions?
Consider that gpt-3.5-turbo comes with a lot of baggage: It has already been trained to chat, been trained to deny and apologize, and follow tons of different prompt styles and conversations.
For a non-chat single turn application, babbage-002 or davinci-002 would likely be far better. They take a simple input and output pair as fine-tuning.
Hmmm. Good point. One attractive thing about GPT-3.5-turbo is its higher ceiling than GPT-3.
Is this a “there’s not enough samples to fit in a JSONL file to unlearn 3.5’s instruction tuning” problem? Or is it that I might just have to quadruple my training set type problem?
how does your binary classifier prompt look like? (If any)
Why choosing 1 and 2 as reply tokens to get a binary result from a basically instruction-type model instead of common choice of 0 and 1?
What are your temp and top p parameters?
I’ve tested different temp and top p on the trained model — again, it adheres to my requested schema, but classification accuracy is poor. (Are you talking about change/tuning temp/top-p while training?)
Good point on the (0,1) instead of (1,2). I’ll retrain with that.
f""" Below are two arguments.
The first is delimited by backticks ` .
The second is delimited by angle brackets <> .
State the number of the one you think is best written.
Of course, binary classification like this is a nontrivial problem. GPT-4 seems to perform excellent, however. Curious to hear your thoughts!
Personally, if I find myself to decide on the better of 2 - I don’t consider it a binary classification.
For me binary classification is when you put a system in a “black and white” world, show it a thing and all a question : is this black? Y/N.
Here, the “better” must be defined and evaluated before answering “which is better”.
I would approach it in two steps:
- define criteria of what is ideal
- note sample based on criteria (e.g. 0-9)
- sort samples by note
- find the best
And the above before even going into parameters mess…
But then, I don’t have all the context and reasons to judge.