Fine-Tuning Classifiers; Poor Performance

I trained a binary classifier with the new GPT-3.5-turbo fine-tuning API. Attached is an image of its performance.

After testing the model, it does adhere to the schema (1 or 2), but its performance seems wild and poor. I used ~700 quality samples.

I expect classifiers to require more samples than normal. Has anyones tested this? Any ideas where they see improvements? Any hyperparameter suggestions?

1 Like

Consider that gpt-3.5-turbo comes with a lot of baggage: It has already been trained to chat, been trained to deny and apologize, and follow tons of different prompt styles and conversations.

For a non-chat single turn application, babbage-002 or davinci-002 would likely be far better. They take a simple input and output pair as fine-tuning.

1 Like

Hmmm. Good point. One attractive thing about GPT-3.5-turbo is its higher ceiling than GPT-3.

Is this a “there’s not enough samples to fit in a JSONL file to unlearn 3.5’s instruction tuning” problem? Or is it that I might just have to quadruple my training set type problem?


how does your binary classifier prompt look like? (If any)

Why choosing 1 and 2 as reply tokens to get a binary result from a basically instruction-type model instead of common choice of 0 and 1?

What are your temp and top p parameters?

1 Like

I’ve tested different temp and top p on the trained model — again, it adheres to my requested schema, but classification accuracy is poor. (Are you talking about change/tuning temp/top-p while training?)

Good point on the (0,1) instead of (1,2). I’ll retrain with that.


f""" Below are two arguments.
The first is delimited by backticks ` .
The second is delimited by angle brackets <> .
State the number of the one you think is best written.

Argument 1: 

Argument 2:

Of course, binary classification like this is a nontrivial problem. GPT-4 seems to perform excellent, however. Curious to hear your thoughts!

I see.

Personally, if I find myself to decide on the better of 2 - I don’t consider it a binary classification.

For me binary classification is when you put a system in a “black and white” world, show it a thing and all a question : is this black? Y/N.

Here, the “better” must be defined and evaluated before answering “which is better”.

I would approach it in two steps:

  1. Evaluate:
  • define criteria of what is ideal
  • note sample based on criteria (e.g. 0-9)
  1. Compare:
  • sort samples by note
  • find the best

And the above before even going into parameters mess…

But then, I don’t have all the context and reasons to judge.