# Compute the probability of input text for classification

To clarify, I’m not asking about finetuning vs zero shot. Let me know if this reply above was unclear. I’m asking about two different approaches to classification using a language model:

1. completion: sample from the distribution Pr( · | `inputted prompt`) and return that sampled output as the predicted class.
2. estimation: for each `inputted completion` (a new required input) in the label set, compute Pr(`inputted completion` | `inputted prompt`). Then return the `inputted completion` with the highest probability, perhaps after making an adjustment for the number of tokens in the `inputted completion` (as done in perplexity calculations).

Finetuning GPT-3 using the exact same dataset and loss helps both approaches. And both approaches can be used in a zero shot manner. The difference is only in how classification is performed. What do you think of the estimation approach? I discussed advantages in the linked reply above.

I guess I am a bit confused here. The core model is using a neural network. I don’t see any distributions in the model. From a high level, it is implementing this paper … am I wrong?

GPT-3 is an autoregressive language model. It models the probability of a token given previous tokens using a transformer neural network. See equation 1 in the first GPT paper1. In a completion endpoint response, `token_logprobs` is GPT-3’s estimated log Pr(`token n+1` | `tokens 1:n`). And completion works by sampling from this probability distribution. That’s why completion is non-deterministic.

From a high level, it is implementing this paper … am I wrong?

Yes, the paper you’ve linked is about the generic transformer architecture for neural networks. This architecture can be applied to any type of task. It’s present in GPT, BERT, CLIP (images), and more.

1. Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018).
1 Like

OK. I see. Maybe one of the OpenAI engineers can chime in.

Ok sounds good. Thank you for sharing your experiences and general advice on classification w/ GPT-3

1 Like

Quick follow-up: I asked a more precise version of the question in a different forum here.

1 Like

It looks like you are digging deep into the algorithms and want validation on your ideas before implementing it.

One question I have is, why not just implement it both ways, and see which way performs better?

Yup would love to do that, but no endpoint currently let’s one compute Pr(`input token` | `other inputted tokens`)

`logprobs` in the completion endpoint only gives Pr(`output token` | `inputted tokens`)

I was actually thinking of doing your own controlled experiment.

You code it both ways, and so you are in control.

In my experience, this is really the only way to see what’s what, especially in your case where you have a hypothesis (theory) and therefore need to test it. Basically the scientific method!

Now if you can’t code out a small controlled experiment, then well, that’s another thing. Then try to isolate the problem even more and solve that.

Totally agree. But the GPT-3 model weights are not public. There’s just no way to compute what’s needed to run the experiment

Right, so create a small version of your own, with your own weights. And run it both ways to get insight. Don’t use GPT-3.

1 Like

I ran zero-shot sampling/completion w/ GPT-3 curie (the second largest GPT-3) and got 3% accuracy on a very difficult classification task. I then ran the proposed method zero-shot w/ an open-source GPT-2 (technically, a GPT-2 which is half the size of the main GPT-2), and got 14% accuracy on the same task.

The experiment isn’t controlled b/c the models are different, but GPT-3 curie is purportedly much more capable than GPT-2. So this result makes it look like the proposed method is much better. But I’m certain that there’s no good reason to extrapolate the result all the way to GPT-3 text-davinci-003, which is additionally trained w/ humans in the loop. Sampling from text-davinci-003 is 60% accurate. So my only real question is how well the proposed method works on text-davinci-003 and davinci. Maybe it’s still 60% accurate, maybe it’s 65% accurate, who knows.

There are defiantly a lot of variables floating around here.

When I hear “Zero Shot” and “Difficult Classification Task”, I immediately think of training one of the base GPT-3 models to immensely improve the classification before going too much further. Do you think a good fine-tune on davinci would get it from 60% to 90%? Can you test this somehow? I know the models are black boxes, but you can still evaluate the correctness in the limited output data.

Also I am a bit confused, it seems like you are wanting to alter the internals to get a better answer. How are you evaluating your new alternative on davinci without having access to the internals?

I only evaluated the proposed method on GPT-2 b/c it’s open-source. Next, I’d like to evaluate the alternative on davinci and text-davinci-003.

Do you think a good fine-tune on davinci would get it from 60% to 90%? Can you test this somehow?

I’ll finetune davinci eventually, and I think it will significantly help both methods. But I’d prioritize comparing the proposed method vs sampling/completion in the zero-shot regime b/c:

1. It’s way less work and money.
2. Zero-shot classification is a big and bold benefit of large language models. If the proposed method consistently outperforms sampling on zero-shot classification, that’s an important result.
3. While both methods should be assessed after finetuning (in addition to before), the impact of finetuning is not necessarily relevant to the question: what’s the best way to frame classification problems when we have a big, capable LM like GPT-3? Is it better to autoregressively sample from it, or to just do Bayes-optimal classification? Maybe the answer depends on whether training data is available, but right now I don’t see why that’d be.
1 Like

For classification, what I do is have one output token and select a temperature of 0. This, I assume, for a highly fine-tuned base model is close to Bayes-optimal classification, at least in terms of what the network understands. And you can get good results with the lower end ada and babbage.

With higher end curie and davinci, you can do the same, but it is my belief that it can achieve the same performance of the lower models with less data.

As for autoregressive (or extrapolation) I would be weary of this for classification, but I’m probably not seeing exactly why extrapolating is a useful classification technique, so feel free to enlighten me.

1 Like

Yup, to reiterate your points: the closest thing to Bayes-optimal classification using the completion endpoint is to:

1. Transform or point to each class using a single token
2. Set `max_tokens=1`
3. Set `temperature=0`
4. Set `logit_bias` = {class token id: log Pr(class)}, where Pr(class) is estimated from training data (or guessed!)

The problems with transforming a class to a single token are that:

1. The transformation is not always a trivial prompt engineering task when the classes are meaningful phrases, or when there are a lot of classes.
2. Even if it is trivial, the completion still is not guaranteed to be one of the single tokens used to represent classes. This forces the user to study degenerate completions and then implement ways to post-process them.
3. If the transformation doesn’t include the class’ original name, then useful semantics in the class name would be unexploited by GPT-3.

I just see sampling as an unnecessary workaround. There’s a potentially simpler approach which should be evaluated.

2 Likes
1. The transformation is not always a trivial prompt engineering task when the classes are meaningful phrases, or when there are a lot of classes.

I would avoid lots of classes coming out of one classifier, mainly because I want to maximize SNR. If you need lots of classes, create more classifiers and have each classifier handle a smaller set of classes.

Now when the classes are meaningful phrases? I would avoid that too, maybe I’m not seeing the benefit of this. You could always map the single token class back to meaningful phrases through a lookup, either straight lookup of some sort, or a correlated lookup like an embedding.

1. Even if it is trivial, the completion still is not guaranteed to be one of the single tokens used to represent classes. This forces the user to study degenerate completions and then implement ways to post-process them.

You can use the `token_logprobs` to at least see how close the classification was to your token, and you can backoff on any action if it’s not close enough.

As for degenerate completions, you will always have to code the corner cases coming out of these. Simple example would be if your classifier expects to have ‘0’ or ‘1’ in the output, the fine-tuned GPT-3 model can output ’ 0’, ’ zero’, etc, and so you alias these back to ‘0’. You can even seed it with entity extraction values from the original input (see below for running multiple models in parallel).

In the case of bad classifications, then this is where multiple models come in. You run a variety of diverse models on the same input, and you make a decision based on the entirety of the output. These models can even be non-AI based, such as RegEx correlators. You just need an algorithm on the back end to fuse this information into a final result.

1. If the transformation doesn’t include the class’ original name, then useful semantics in the class name would be unexploited by GPT-3.

I’d need an example of this one. But like I mentioned earlier, useful semantics from the classification could be restored by lookups (vector or direct) and seeded with entity extraction or other classifiers … all in the background AI and non-AI running in parallel on the incoming data.

I just see sampling as an unnecessary workaround. There’s a potentially simpler approach which should be evaluated.

Yes, there are simpler approaches! And these are what I would use in the background in parallel. Then integrate the responses (via direct code, or AI, or both) into the final answer.

1 Like

Hi @chicxulub. I’m also in need of an estimation capability from the GPT3 series. Have you figured out a means of reliably computing P(completion | prefix), for a user-specified completion and prefix?

Ah, I forgot to update this community! Yes, you now (I think as of at least a month ago) can set `max_tokens=0, logprobs=1, echo=True` and get the log-probabilities for each token in the input.

Here’s a minimal implementation in Python

``````import os

import openai
import tiktoken

openai.api_key = os.getenv('OPENAI_API_KEY')

prefix     = 'hey how'
completion = ' are ya'

response = openai.Completion.create(model=model,
prompt=prefix + completion,
max_tokens=0,
logprobs=1,
echo=True)

token_logprobs = response['choices'][0]['logprobs']['token_logprobs']

# post-process to get what we want
tokenizer = tiktoken.encoding_for_model(model)
num_completion_tokens = len(tokenizer.encode(completion))
# apply probability chain rule:
# log Pr(are ya | hey how) = log Pr(ya | hey how are) + log Pr(are | hey how)
logprob_completion_given_prefix = sum(token_logprobs[-num_completion_tokens:])
prob_completion_given_prefix = 2.718 ** logprob_completion_given_prefix
prob_completion_given_prefix
# avoid plugging this into other calculations, as it may underflow
(for fun) I’m working on a project which uses this functionality to do zero-shot text classification. Here’s the repo. An important difference is that I actually take a mean instead of a `sum`, since longer completions may trivially result in lower probabilities. I don’t want that for classification.