Compute the probability of input text for classification

High-level problem

Say my input text gets tokenized into [t1, t2, t3, t4]. Can I use the API to estimate the log-probability of text as

log Pr(t1, t2, t3, t4) ?

If this estimation isn’t already implemented in the API, is there a way to estimate this type of conditional probability:

log Pr(t3, t4 | t1, t2) = log Pr(t3 | t1, t2) + log Pr(t4 | t1, t2, t3) ?

In this conditional probability, I’m the one providing the hypothesized completion text [t3, t4], not GPT-3. Setting the logprobs argument isn’t sufficient b/c (1) GPT-3 may not happen to sample t3 or t4 or (2) the response may exclude t3 or t4 if they aren’t among the top logprobs tokens (at their positions). That’s why the Completion endpoint doesn’t seem sufficient for my problem.

Motivation

I’d like to evaluate a simple approach to language classification tasks, specifically where labels are textually meaningful. For example, in a sentiment classification task, an alternative to text completion is to estimate the probability of an inputted, completed text—

The sentiment of this tweet

"""
I loved the new Batman movie!
"""

is {sentiment}.

—where {sentiment} is replaced w/ positive, negative, or neutral, and then return the sentiment which gave the highest log-probability.

Has this method been evaluated anywhere? (IIUC this approach is quite different than the deprecated Classification endpoint.)

I suppose OpenAI won’t release probabilities too liberally b/c a user could train on them and compete w/ GPT-3 or something. But I’m hoping there’s some way to provide the above approach to classification. It seems more straightforward than embeddings or completion.

1 Like

Is {sentiment} the only thing in your completion?

If it is, a logprobs of 3 will give you all 3 values

I assume you want the probability score for the positive/negative/neutral result and not the text you are asking it to classify?

If it is not structured this way, can you restructure your training so the prompt is everything except the {sentiment} (with a space before it)?

Another approach is to look at the “classification” part of “embedding”. From memory it also gives log values.

1 Like

Thank you for your response, @raymonddavey.

I assume you want the probability score for the positive/negative/neutral result and not the text you are asking it to classify?

Yes. Given inputted texts x (the text I wanna classify, including some prompting language) and y (one of the labels that I’ve hand-crafted), I’d like GPT-3’s estimate of Pr(y | x). Though, since this problem is a classification one, Pr(x, y) and even argmax Pr(y | x), or equivalently argmax Pr(x, y), work too.

If it is, a logprobs of 3 will give you all 3 values

For simple prompts, y’s tokens may indeed be likely enough to be included among the top logprobs. But nothing explicitly prevents y’s tokens from being outside of the top logprobs (or never sampled), and thus completely missing in the Completion endpoint’s response.

Here’s an example of a harder classification problem:

import os
import openai

openai.api_key = os.getenv('OPENAI_API_KEY')

prompt = '''
A movie review can only belong to one of these categories: "Just another superhero movie" or "Generic hype".

Which category does this movie review belong to?
"""
A thrill ride for the ages! --Peter K. Rosenthal
"""
'''

response = openai.Completion.create(model='text-davinci-003',
                                    prompt=prompt,
                                    max_tokens=20,
                                    temperature=0)
print(response['choices'][0]['text'])
# prints: This movie review does not belong belong to either of the categories.

The correct label is 'Generic hype' of course. And while it’s nice to see GPT-3 conveying uncertainty, it might’ve been the case that Pr('Generic hype' | movie review, prompt) > Pr('Just another superhero movie' | movie review, prompt), even though they’re both low. So the method proposed in this question would result in a correct prediction, instead of the uncertain one that the completion method gave.

We can go down the prompt engineering rabbit hole to increase the chance that the completion endpoint either predicts a class in the label set, or includes y in its logprobs. But that’s neither simple nor completely effective. Estimating Pr(y | x) is both of these things.

If it is not structured this way, can you restructure your training so the prompt is everything except the {sentiment} (with a space before it)?

To clarify, there’s no training necessary in this method. Though I know what you mean, and indeed that’s the standard completion approach to solve classification problems. But the method described in my question is much simpler. There’s no sampling; it just computes what GPT-3 has already modeled.

Another approach is to look at the “classification” part of “embedding”. From memory it also gives log values.

I assume you’re referring to training a classifier using embeddings as features. That approach indeed works. But I’d like to avoid going through embeddings just to loosely estimate something which is ideally immediately available for an autoregressive language model.

1 Like

I know that you are not doing fine-tuning, but I guess this is why all the examples that OpenAI give for classifiers suggest using a single token in the completion. (point 2 on the link)

“Choose classes that map to a single token. At inference time, specify max_tokens=1 since you only need the first token for classification.”

They infer that you do this so you can use max_tokens = 1, but it is also required for logprobs.

For classification, I fine tuned babbage to categorize incoming text into ‘0’ or ‘1’. I forced it to have a single token as the output and a temperature of 0. I used the token_logprobs to see how close it really was to either the ‘0’ or ‘1’. This will also generalize to more that two categories.

There is no need for embeddings in classification problems where the number of classes is small. If you have many classes, then you might want to use embeddings instead. In your case of three classes, I would just do a fine tune.

P.S. In your case of sentiment, you could also train it on binary pairs ‘negative’ and ‘positive’, my ‘0’ or ‘1’, and then let the log prob determine if it was really neutral (don’t train it on ‘neutral’). Having less classes cuts down on your noise (improves your SNR), and let the engine determine neutral numerically. In this sense, your answer is really on the interval [-1, 1]. Let the log prob convert it to this interval and declare your own version of ‘neutral’.

3 Likes

Thank you both for the advice on solving classification problems using currently available endpoints: finetune w/ single-token labels and then complete w/ logprobs=# classes and max_tokens=1.

I’d like to pivot to discussing the theoretical merits of directly estimating Pr(my inputted completion | my inputted prompt) vs. what’s currently available: sample a completion given my inputted prompt and provide Pr(GPT-3's outputted completion | my inputted prompt). Here’s a short comparison b/t estimation and completion:

  • Both can be zero shot or finetuned using the exact same data and loss.
  • Estimation guarantees estimation of the probabilities necessary for Bayes optimal classification. Completion does not. Transforming each label to a single token should almost guarantee it, though this seems more effective when finetuning is feasible.
  • Estimation does not require transforming labels to single tokens. This advantage could be significant b/c it allows GPT-3 to exploit the label’s semantics. In the movie review example above, the label 'Just another superhero movie' is richer than any single token. Maybe this turns out to be a drawback in practice though; it’s hard to say.

One obvious problem is that argmax Pr(my inputted completion | my inputted prompt) returns the shortest completion, because probabilities are in [0,1]. An easy way to get around that is to take the average per token, which is standard practice in perplexity calculations.

Overall, I see completion/sampling as an unnecessary workaround to solve classification problems. I’d like to hear about the disadvantages of estimation.

P.S. In your case of sentiment, you could also train it on binary pairs ‘negative’ and ‘positive’, my ‘0’ or ‘1’, and then let the log prob determine if it was really neutral (don’t train it on ‘neutral’).

To clarify, I’m not solving any specific classification problem. Though to further discussion on this interesting idea: have you or others ran experiments w/ this method? Currently, many sentiment classifiers include neutral texts during training, e.g., huggingface’s sentiment tutorial. Here’s an old reference1 for why that might be. And intuitively, I don’t see how Pr(neutral | text) could be calibrated or discriminative if the model never saw neutral text. Maybe dropping neutral examples trades off accuracy on the neutral class for greater accuracy on the others. It also introduces the inelegant follow-up problem of estimating cutoffs for the neutral class. There’s a broader discussion on these sort of task transformation methods here.

  1. Koppel, Moshe, and Jonathan Schler. “The importance of neutral examples for learning sentiment.” Computational intelligence 22.2 (2006): 100-109.

To estimate the performance, I would use labeled examples or ‘truth data’ that wasn’t used in training. But with GPT-3, who knows what they used, so you really can’t do this purely, but you can get close to this when looking at your own fine tunes. It pretty much has to be empirically measured, IMO.

As for zero shot, it gives you the most flexibility, but realize there is a hit to cost, since your prompts will be larger. But running a fine tuned model also costs more in general per token (for the smaller prompt), so you should do a cost trade study and see which one makes more cost sense vs. performance and flexibility.

Also another hyper parameter to worry about, and one that EXPLODES your trade space with zero-shot, is the exact language you use in the prompt. You can get drastically different answers if one word is changed in the prompt. But with experimentation, you could make an informed decision about which prompt wording works better … and another annoying thing is that this can vary over the version of the model you are using, even within the same model class such as ‘davinci’. Also along these lines, you have cost differences between ada, babbage, curie, and davinci … but they have performance variations too. Depending on the task and amount of training data, you can do a great job performance-wise with the lower models at the lower cost.

But the good thing is all of the Transfer Learning that occurs during a fine tune or a zero-shot. In the case of fine tuning, your training data set can be a lot smaller, and get good performance, vs training your own model from scratch … this is the big advantage here. Good training data is hard to come by, so why not leverage GPT-3 and create less of it to get an acceptable answer.

As for the training of ‘neutral’ … if you care or rely on neutral as an instantaneous result, then yeah it makes sense to train it. In my case I only cared about ‘negative’ and ‘positive’ and let this map to a set of integers between +/-10, and further averaging it over time. Over time a ‘neutral’ would emerge, but I only cared about the extremes over time.

1 Like

I appreciate the general advice around optimizing usage of the finetuning and completion endpoints. But I’d like to re-center the discussion to comparing estimation vs completion at a theoretical level. My goal now is to understand the technical reasons for not making estimation or argmax classification endpoints available.

As for zero shot, it gives you the most flexibility, but realize there is a hit to cost, since your prompts will be larger. But running a fine tuned model also costs more in general per token (for the smaller prompt), so you should do a cost trade study and see which one makes more cost sense vs. performance and flexibility.

Also another hyper parameter to worry about, and one that EXPLODES your trade space with zero-shot, is the exact language you use in the prompt.

another annoying thing is that this can vary over the version of the model you are using

The prompt is given in both estimation and completion, and both can be finetuned (or not) using the exact same data. So don’t these problems apply equally?

So far, I’m not yet seeing a relative downside to estimation.

Maybe an example would help me what you are looking for.

But the main downside to fine tuning is cost. But the benefits from my perspective is that you are training the model output format. For example, I can see in the logs that in my fine tuned example of {Input Text} → {Select ‘0’ or ‘1’ based on the fine tune training data}, the model will come up with a variety of answers including ‘0’ (the most probable) and ‘zero’ (the string ‘zero’ that does not adhere to what I want which is ‘0’). But because I fine tuned it, it always selects ‘0’, even though it thinks ‘zero’ is another option.

So fine tuning and argmax gives me a specific format, which is good. If I were to do this with a zero-shot, then my concern would be the output would drift between ‘0’ and ‘zero’ or whatever else, because it hasn’t been trained to only respond with ‘0’ or ‘1’. So basically, to avoid drift in the output, I am using argmax-fine-tuned models. But maybe your prompt would prevent drift, or you map all the drifting aliases back to the correct label (you’d have to experiment with this).

However, if I really want drift, and I need a lot of context that varies dynamically based on the input, then use zero-shot (with embeddings), similar to the “Truthful Chatbot” example: Question answering using embeddings-based search | OpenAI Cookbook

1 Like

Also, as a side note. Using a zero-shot classifier on things GPT-3 understands (like sentiment) is viable, and it probably does pretty good out of the box (minus any format issues in the output). But in my case, I am literally sending text that has no describable rule and forcing GPT-3 to learn the rules and save them in the fine tuned coefficients.

So if you can’t describe the rule to GPT-3, you are pretty much forced to train it by fine tuning.

2 Likes

To clarify, I’m not asking about finetuning vs zero shot. Let me know if this reply above was unclear. I’m asking about two different approaches to classification using a language model:

  1. completion: sample from the distribution Pr( · | inputted prompt) and return that sampled output as the predicted class.
  2. estimation: for each inputted completion (a new required input) in the label set, compute Pr(inputted completion | inputted prompt). Then return the inputted completion with the highest probability, perhaps after making an adjustment for the number of tokens in the inputted completion (as done in perplexity calculations).

Finetuning GPT-3 using the exact same dataset and loss helps both approaches. And both approaches can be used in a zero shot manner. The difference is only in how classification is performed. What do you think of the estimation approach? I discussed advantages in the linked reply above.

I guess I am a bit confused here. The core model is using a neural network. I don’t see any distributions in the model. From a high level, it is implementing this paper … am I wrong?

GPT-3 is an autoregressive language model. It models the probability of a token given previous tokens using a transformer neural network. See equation 1 in the first GPT paper1. In a completion endpoint response, token_logprobs is GPT-3’s estimated log Pr(token n+1 | tokens 1:n). And completion works by sampling from this probability distribution. That’s why completion is non-deterministic.

From a high level, it is implementing this paper … am I wrong?

Yes, the paper you’ve linked is about the generic transformer architecture for neural networks. This architecture can be applied to any type of task. It’s present in GPT, BERT, CLIP (images), and more.

  1. Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018).
1 Like

OK. I see. Maybe one of the OpenAI engineers can chime in.

Ok sounds good. Thank you for sharing your experiences and general advice on classification w/ GPT-3 :slightly_smiling_face:

1 Like

Quick follow-up: I asked a more precise version of the question in a different forum here.

1 Like

It looks like you are digging deep into the algorithms and want validation on your ideas before implementing it.

One question I have is, why not just implement it both ways, and see which way performs better?

Yup would love to do that, but no endpoint currently let’s one compute Pr(input token | other inputted tokens) :frowning:

logprobs in the completion endpoint only gives Pr(output token | inputted tokens)

I was actually thinking of doing your own controlled experiment.

You code it both ways, and so you are in control.

In my experience, this is really the only way to see what’s what, especially in your case where you have a hypothesis (theory) and therefore need to test it. Basically the scientific method!

Now if you can’t code out a small controlled experiment, then well, that’s another thing. Then try to isolate the problem even more and solve that.

Totally agree. But the GPT-3 model weights are not public. There’s just no way to compute what’s needed to run the experiment