Thank you both for the advice on solving classification problems using currently available endpoints: finetune w/ single-token labels and then complete w/ logprobs=# classes and max_tokens=1.
I’d like to pivot to discussing the theoretical merits of directly estimating Pr(my inputted completion | my inputted prompt) vs. what’s currently available: sample a completion given my inputted prompt and provide Pr(GPT-3's outputted completion | my inputted prompt). Here’s a short comparison b/t estimation and completion:
- Both can be zero shot or finetuned using the exact same data and loss.
- Estimation guarantees estimation of the probabilities necessary for Bayes optimal classification. Completion does not. Transforming each label to a single token should almost guarantee it, though this seems more effective when finetuning is feasible.
- Estimation does not require transforming labels to single tokens. This advantage could be significant b/c it allows GPT-3 to exploit the label’s semantics. In the movie review example above, the label
'Just another superhero movie'is richer than any single token. Maybe this turns out to be a drawback in practice though; it’s hard to say.
One obvious problem is that argmax Pr(my inputted completion | my inputted prompt) returns the shortest completion, because probabilities are in [0,1]. An easy way to get around that is to take the average per token, which is standard practice in perplexity calculations.
Overall, I see completion/sampling as an unnecessary workaround to solve classification problems. I’d like to hear about the disadvantages of estimation.
P.S. In your case of sentiment, you could also train it on binary pairs ‘negative’ and ‘positive’, my ‘0’ or ‘1’, and then let the log prob determine if it was really neutral (don’t train it on ‘neutral’).
To clarify, I’m not solving any specific classification problem. Though to further discussion on this interesting idea: have you or others ran experiments w/ this method? Currently, many sentiment classifiers include neutral texts during training, e.g., huggingface’s sentiment tutorial. Here’s an old reference1 for why that might be. And intuitively, I don’t see how Pr(neutral | text) could be calibrated or discriminative if the model never saw neutral text. Maybe dropping neutral examples trades off accuracy on the neutral class for greater accuracy on the others. It also introduces the inelegant follow-up problem of estimating cutoffs for the neutral class. There’s a broader discussion on these sort of task transformation methods here.
- Koppel, Moshe, and Jonathan Schler. “The importance of neutral examples for learning sentiment.” Computational intelligence 22.2 (2006): 100-109.