Right, so create a small version of your own, with your own weights. And run it both ways to get insight. Don’t use GPT-3.
I ran zero-shot sampling/completion w/ GPT-3 curie (the second largest GPT-3) and got 3% accuracy on a very difficult classification task. I then ran the proposed method zero-shot w/ an open-source GPT-2 (technically, a GPT-2 which is half the size of the main GPT-2), and got 14% accuracy on the same task.
The experiment isn’t controlled b/c the models are different, but GPT-3 curie is purportedly much more capable than GPT-2. So this result makes it look like the proposed method is much better. But I’m certain that there’s no good reason to extrapolate the result all the way to GPT-3 text-davinci-003, which is additionally trained w/ humans in the loop. Sampling from text-davinci-003 is 60% accurate. So my only real question is how well the proposed method works on text-davinci-003 and davinci. Maybe it’s still 60% accurate, maybe it’s 65% accurate, who knows.
There are defiantly a lot of variables floating around here.
When I hear “Zero Shot” and “Difficult Classification Task”, I immediately think of training one of the base GPT-3 models to immensely improve the classification before going too much further. Do you think a good fine-tune on davinci would get it from 60% to 90%? Can you test this somehow? I know the models are black boxes, but you can still evaluate the correctness in the limited output data.
Also I am a bit confused, it seems like you are wanting to alter the internals to get a better answer. How are you evaluating your new alternative on davinci without having access to the internals?
How are you evaluating your new alternative on davinci without having access to the internals?
I only evaluated the proposed method on GPT-2 b/c it’s open-source. Next, I’d like to evaluate the alternative on davinci and text-davinci-003.
Do you think a good fine-tune on davinci would get it from 60% to 90%? Can you test this somehow?
I’ll finetune davinci eventually, and I think it will significantly help both methods. But I’d prioritize comparing the proposed method vs sampling/completion in the zero-shot regime b/c:
- It’s way less work and money.
- Zero-shot classification is a big and bold benefit of large language models. If the proposed method consistently outperforms sampling on zero-shot classification, that’s an important result.
- While both methods should be assessed after finetuning (in addition to before), the impact of finetuning is not necessarily relevant to the question: what’s the best way to frame classification problems when we have a big, capable LM like GPT-3? Is it better to autoregressively sample from it, or to just do Bayes-optimal classification? Maybe the answer depends on whether training data is available, but right now I don’t see why that’d be.
For classification, what I do is have one output token and select a temperature of 0. This, I assume, for a highly fine-tuned base model is close to Bayes-optimal classification, at least in terms of what the network understands. And you can get good results with the lower end ada and babbage.
With higher end curie and davinci, you can do the same, but it is my belief that it can achieve the same performance of the lower models with less data.
As for autoregressive (or extrapolation) I would be weary of this for classification, but I’m probably not seeing exactly why extrapolating is a useful classification technique, so feel free to enlighten me.
Yup, to reiterate your points: the closest thing to Bayes-optimal classification using the completion endpoint is to:
- Transform or point to each class using a single token
- Set
max_tokens=1 - Set
temperature=0 - Set
logit_bias= {class token id: log Pr(class)}, where Pr(class) is estimated from training data (or guessed!)
The problems with transforming a class to a single token are that:
- The transformation is not always a trivial prompt engineering task when the classes are meaningful phrases, or when there are a lot of classes.
- Even if it is trivial, the completion still is not guaranteed to be one of the single tokens used to represent classes. This forces the user to study degenerate completions and then implement ways to post-process them.
- If the transformation doesn’t include the class’ original name, then useful semantics in the class name would be unexploited by GPT-3.
I just see sampling as an unnecessary workaround. There’s a potentially simpler approach which should be evaluated.
- The transformation is not always a trivial prompt engineering task when the classes are meaningful phrases, or when there are a lot of classes.
I would avoid lots of classes coming out of one classifier, mainly because I want to maximize SNR. If you need lots of classes, create more classifiers and have each classifier handle a smaller set of classes.
Now when the classes are meaningful phrases? I would avoid that too, maybe I’m not seeing the benefit of this. You could always map the single token class back to meaningful phrases through a lookup, either straight lookup of some sort, or a correlated lookup like an embedding.
- Even if it is trivial, the completion still is not guaranteed to be one of the single tokens used to represent classes. This forces the user to study degenerate completions and then implement ways to post-process them.
You can use the token_logprobs to at least see how close the classification was to your token, and you can backoff on any action if it’s not close enough.
As for degenerate completions, you will always have to code the corner cases coming out of these. Simple example would be if your classifier expects to have ‘0’ or ‘1’ in the output, the fine-tuned GPT-3 model can output ’ 0’, ’ zero’, etc, and so you alias these back to ‘0’. You can even seed it with entity extraction values from the original input (see below for running multiple models in parallel).
In the case of bad classifications, then this is where multiple models come in. You run a variety of diverse models on the same input, and you make a decision based on the entirety of the output. These models can even be non-AI based, such as RegEx correlators. You just need an algorithm on the back end to fuse this information into a final result.
- If the transformation doesn’t include the class’ original name, then useful semantics in the class name would be unexploited by GPT-3.
I’d need an example of this one. But like I mentioned earlier, useful semantics from the classification could be restored by lookups (vector or direct) and seeded with entity extraction or other classifiers … all in the background AI and non-AI running in parallel on the incoming data.
I just see sampling as an unnecessary workaround. There’s a potentially simpler approach which should be evaluated.
Yes, there are simpler approaches! And these are what I would use in the background in parallel. Then integrate the responses (via direct code, or AI, or both) into the final answer.
Hi @chicxulub. I’m also in need of an estimation capability from the GPT3 series. Have you figured out a means of reliably computing P(completion | prefix), for a user-specified completion and prefix?
Ah, I forgot to update this community! Yes, you now (I think as of at least a month ago) can set max_tokens=0, logprobs=1, echo=True and get the log-probabilities for each token in the input.
Here’s a minimal implementation in Python
import os
import openai
import tiktoken
openai.api_key = os.getenv('OPENAI_API_KEY')
model = 'text-ada-001'
prefix = 'hey how'
completion = ' are ya'
response = openai.Completion.create(model=model,
prompt=prefix + completion,
max_tokens=0,
logprobs=1,
echo=True)
token_logprobs = response['choices'][0]['logprobs']['token_logprobs']
# post-process to get what we want
tokenizer = tiktoken.encoding_for_model(model)
num_completion_tokens = len(tokenizer.encode(completion))
# apply probability chain rule:
# log Pr(are ya | hey how) = log Pr(ya | hey how are) + log Pr(are | hey how)
logprob_completion_given_prefix = sum(token_logprobs[-num_completion_tokens:])
prob_completion_given_prefix = 2.718 ** logprob_completion_given_prefix
prob_completion_given_prefix
# avoid plugging this into other calculations, as it may underflow
# use logprob_completion_given_prefix instead
(for fun) I’m working on a project which uses this functionality to do zero-shot text classification. Here’s the repo. An important difference is that I actually take a mean instead of a sum, since longer completions may trivially result in lower probabilities. I don’t want that for classification.
Feel free to install the package, or copy-paste whatever code you need w/o credit