Compute the probability of input text for classification

Thank you both for the advice on solving classification problems using currently available endpoints: finetune w/ single-token labels and then complete w/ logprobs=# classes and max_tokens=1.

I’d like to pivot to discussing the theoretical merits of directly estimating Pr(my inputted completion | my inputted prompt) vs. what’s currently available: sample a completion given my inputted prompt and provide Pr(GPT-3's outputted completion | my inputted prompt). Here’s a short comparison b/t estimation and completion:

  • Both can be zero shot or finetuned using the exact same data and loss.
  • Estimation guarantees estimation of the probabilities necessary for Bayes optimal classification. Completion does not. Transforming each label to a single token should almost guarantee it, though this seems more effective when finetuning is feasible.
  • Estimation does not require transforming labels to single tokens. This advantage could be significant b/c it allows GPT-3 to exploit the label’s semantics. In the movie review example above, the label 'Just another superhero movie' is richer than any single token. Maybe this turns out to be a drawback in practice though; it’s hard to say.

One obvious problem is that argmax Pr(my inputted completion | my inputted prompt) returns the shortest completion, because probabilities are in [0,1]. An easy way to get around that is to take the average per token, which is standard practice in perplexity calculations.

Overall, I see completion/sampling as an unnecessary workaround to solve classification problems. I’d like to hear about the disadvantages of estimation.

P.S. In your case of sentiment, you could also train it on binary pairs ‘negative’ and ‘positive’, my ‘0’ or ‘1’, and then let the log prob determine if it was really neutral (don’t train it on ‘neutral’).

To clarify, I’m not solving any specific classification problem. Though to further discussion on this interesting idea: have you or others ran experiments w/ this method? Currently, many sentiment classifiers include neutral texts during training, e.g., huggingface’s sentiment tutorial. Here’s an old reference1 for why that might be. And intuitively, I don’t see how Pr(neutral | text) could be calibrated or discriminative if the model never saw neutral text. Maybe dropping neutral examples trades off accuracy on the neutral class for greater accuracy on the others. It also introduces the inelegant follow-up problem of estimating cutoffs for the neutral class. There’s a broader discussion on these sort of task transformation methods here.

  1. Koppel, Moshe, and Jonathan Schler. “The importance of neutral examples for learning sentiment.” Computational intelligence 22.2 (2006): 100-109.

To estimate the performance, I would use labeled examples or ‘truth data’ that wasn’t used in training. But with GPT-3, who knows what they used, so you really can’t do this purely, but you can get close to this when looking at your own fine tunes. It pretty much has to be empirically measured, IMO.

As for zero shot, it gives you the most flexibility, but realize there is a hit to cost, since your prompts will be larger. But running a fine tuned model also costs more in general per token (for the smaller prompt), so you should do a cost trade study and see which one makes more cost sense vs. performance and flexibility.

Also another hyper parameter to worry about, and one that EXPLODES your trade space with zero-shot, is the exact language you use in the prompt. You can get drastically different answers if one word is changed in the prompt. But with experimentation, you could make an informed decision about which prompt wording works better … and another annoying thing is that this can vary over the version of the model you are using, even within the same model class such as ‘davinci’. Also along these lines, you have cost differences between ada, babbage, curie, and davinci … but they have performance variations too. Depending on the task and amount of training data, you can do a great job performance-wise with the lower models at the lower cost.

But the good thing is all of the Transfer Learning that occurs during a fine tune or a zero-shot. In the case of fine tuning, your training data set can be a lot smaller, and get good performance, vs training your own model from scratch … this is the big advantage here. Good training data is hard to come by, so why not leverage GPT-3 and create less of it to get an acceptable answer.

As for the training of ‘neutral’ … if you care or rely on neutral as an instantaneous result, then yeah it makes sense to train it. In my case I only cared about ‘negative’ and ‘positive’ and let this map to a set of integers between +/-10, and further averaging it over time. Over time a ‘neutral’ would emerge, but I only cared about the extremes over time.

1 Like

I appreciate the general advice around optimizing usage of the finetuning and completion endpoints. But I’d like to re-center the discussion to comparing estimation vs completion at a theoretical level. My goal now is to understand the technical reasons for not making estimation or argmax classification endpoints available.

As for zero shot, it gives you the most flexibility, but realize there is a hit to cost, since your prompts will be larger. But running a fine tuned model also costs more in general per token (for the smaller prompt), so you should do a cost trade study and see which one makes more cost sense vs. performance and flexibility.

Also another hyper parameter to worry about, and one that EXPLODES your trade space with zero-shot, is the exact language you use in the prompt.

another annoying thing is that this can vary over the version of the model you are using

The prompt is given in both estimation and completion, and both can be finetuned (or not) using the exact same data. So don’t these problems apply equally?

So far, I’m not yet seeing a relative downside to estimation.

Maybe an example would help me what you are looking for.

But the main downside to fine tuning is cost. But the benefits from my perspective is that you are training the model output format. For example, I can see in the logs that in my fine tuned example of {Input Text} → {Select ‘0’ or ‘1’ based on the fine tune training data}, the model will come up with a variety of answers including ‘0’ (the most probable) and ‘zero’ (the string ‘zero’ that does not adhere to what I want which is ‘0’). But because I fine tuned it, it always selects ‘0’, even though it thinks ‘zero’ is another option.

So fine tuning and argmax gives me a specific format, which is good. If I were to do this with a zero-shot, then my concern would be the output would drift between ‘0’ and ‘zero’ or whatever else, because it hasn’t been trained to only respond with ‘0’ or ‘1’. So basically, to avoid drift in the output, I am using argmax-fine-tuned models. But maybe your prompt would prevent drift, or you map all the drifting aliases back to the correct label (you’d have to experiment with this).

However, if I really want drift, and I need a lot of context that varies dynamically based on the input, then use zero-shot (with embeddings), similar to the “Truthful Chatbot” example: openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook · GitHub

1 Like

Also, as a side note. Using a zero-shot classifier on things GPT-3 understands (like sentiment) is viable, and it probably does pretty good out of the box (minus any format issues in the output). But in my case, I am literally sending text that has no describable rule and forcing GPT-3 to learn the rules and save them in the fine tuned coefficients.

So if you can’t describe the rule to GPT-3, you are pretty much forced to train it by fine tuning.

2 Likes

To clarify, I’m not asking about finetuning vs zero shot. Let me know if this reply above was unclear. I’m asking about two different approaches to classification using a language model:

  1. completion: sample from the distribution Pr( · | inputted prompt) and return that sampled output as the predicted class.
  2. estimation: for each inputted completion (a new required input) in the label set, compute Pr(inputted completion | inputted prompt). Then return the inputted completion with the highest probability, perhaps after making an adjustment for the number of tokens in the inputted completion (as done in perplexity calculations).

Finetuning GPT-3 using the exact same dataset and loss helps both approaches. And both approaches can be used in a zero shot manner. The difference is only in how classification is performed. What do you think of the estimation approach? I discussed advantages in the linked reply above.

I guess I am a bit confused here. The core model is using a neural network. I don’t see any distributions in the model. From a high level, it is implementing this paper … am I wrong?

GPT-3 is an autoregressive language model. It models the probability of a token given previous tokens using a transformer neural network. See equation 1 in the first GPT paper1. In a completion endpoint response, token_logprobs is GPT-3’s estimated log Pr(token n+1 | tokens 1:n). And completion works by sampling from this probability distribution. That’s why completion is non-deterministic.

From a high level, it is implementing this paper … am I wrong?

Yes, the paper you’ve linked is about the generic transformer architecture for neural networks. This architecture can be applied to any type of task. It’s present in GPT, BERT, CLIP (images), and more.

  1. Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018).
1 Like

OK. I see. Maybe one of the OpenAI engineers can chime in.

Ok sounds good. Thank you for sharing your experiences and general advice on classification w/ GPT-3 :slightly_smiling_face:

1 Like

Quick follow-up: I asked a more precise version of the question in a different forum here.

1 Like

It looks like you are digging deep into the algorithms and want validation on your ideas before implementing it.

One question I have is, why not just implement it both ways, and see which way performs better?

Yup would love to do that, but no endpoint currently let’s one compute Pr(input token | other inputted tokens) :frowning:

logprobs in the completion endpoint only gives Pr(output token | inputted tokens)

I was actually thinking of doing your own controlled experiment.

You code it both ways, and so you are in control.

In my experience, this is really the only way to see what’s what, especially in your case where you have a hypothesis (theory) and therefore need to test it. Basically the scientific method!

Now if you can’t code out a small controlled experiment, then well, that’s another thing. Then try to isolate the problem even more and solve that.

Totally agree. But the GPT-3 model weights are not public. There’s just no way to compute what’s needed to run the experiment

Right, so create a small version of your own, with your own weights. And run it both ways to get insight. Don’t use GPT-3.

1 Like

I ran zero-shot sampling/completion w/ GPT-3 curie (the second largest GPT-3) and got 3% accuracy on a very difficult classification task. I then ran the proposed method zero-shot w/ an open-source GPT-2 (technically, a GPT-2 which is half the size of the main GPT-2), and got 14% accuracy on the same task.

The experiment isn’t controlled b/c the models are different, but GPT-3 curie is purportedly much more capable than GPT-2. So this result makes it look like the proposed method is much better. But I’m certain that there’s no good reason to extrapolate the result all the way to GPT-3 text-davinci-003, which is additionally trained w/ humans in the loop. Sampling from text-davinci-003 is 60% accurate. So my only real question is how well the proposed method works on text-davinci-003 and davinci. Maybe it’s still 60% accurate, maybe it’s 65% accurate, who knows.

There are defiantly a lot of variables floating around here.

When I hear “Zero Shot” and “Difficult Classification Task”, I immediately think of training one of the base GPT-3 models to immensely improve the classification before going too much further. Do you think a good fine-tune on davinci would get it from 60% to 90%? Can you test this somehow? I know the models are black boxes, but you can still evaluate the correctness in the limited output data.

Also I am a bit confused, it seems like you are wanting to alter the internals to get a better answer. How are you evaluating your new alternative on davinci without having access to the internals?

How are you evaluating your new alternative on davinci without having access to the internals?

I only evaluated the proposed method on GPT-2 b/c it’s open-source. Next, I’d like to evaluate the alternative on davinci and text-davinci-003.

Do you think a good fine-tune on davinci would get it from 60% to 90%? Can you test this somehow?

I’ll finetune davinci eventually, and I think it will significantly help both methods. But I’d prioritize comparing the proposed method vs sampling/completion in the zero-shot regime b/c:

  1. It’s way less work and money.
  2. Zero-shot classification is a big and bold benefit of large language models. If the proposed method consistently outperforms sampling on zero-shot classification, that’s an important result.
  3. While both methods should be assessed after finetuning (in addition to before), the impact of finetuning is not necessarily relevant to the question: what’s the best way to frame classification problems when we have a big, capable LM like GPT-3? Is it better to autoregressively sample from it, or to just do Bayes-optimal classification? Maybe the answer depends on whether training data is available, but right now I don’t see why that’d be.
1 Like

For classification, what I do is have one output token and select a temperature of 0. This, I assume, for a highly fine-tuned base model is close to Bayes-optimal classification, at least in terms of what the network understands. And you can get good results with the lower end ada and babbage.

With higher end curie and davinci, you can do the same, but it is my belief that it can achieve the same performance of the lower models with less data.

As for autoregressive (or extrapolation) I would be weary of this for classification, but I’m probably not seeing exactly why extrapolating is a useful classification technique, so feel free to enlighten me.

1 Like