Introducing CAPPr: a package to easily perform text classification using OpenAI models

What is CAPPr?

CAPPr is a Python package which performs zero-shot text classification by estimating the probability that an inputted completion comes after an inputted prompt. CAPPr = “Completion After Prompt Probability”.





The standard zero-shot classification method using language models is to generate a completion given a prompt. For example, if you’re classifying animals, you’d have the model generate text after

The biological class of a blue whale is

and then hope the output is Mammalia.

Sampling usually works well. The problem is that the string you get could be any plausible completion, not necessarily one in your list of classes. So you’ll have to write custom post-processing code for every new classification task you solve.

CAPPr addresses this problem by reframing the task as a series of simple computations. You are then guaranteed to get a completion from the list of classes which you inputted. Hence eliminating the need to write any custom post-processing code.

Read the motivation page of CAPPr’s documentation if you’re more curious.

Is it good?

I’m still trying to find out lol. I’ve evaluated CAPPr on a grand total of 2 datasets and a handful of examples. So if you’re interested in using the cappr package, make sure to carefully evaluate it :slight_smile:

One interesting result is that on the Choice of Plausible Alternatives task, the zero-shot text-curie-001 model is < 50% accurate when using sampling, but 80% accurate when using CAPPr. (Here’s a link to the experiment notebook.) It would be cool to demonstrate that CAPPr squeezes more out of smaller or less-heavily trained LLMs, as CAPPr’s performance may be based more on next-token prediction performance than instruction-based performance.

Feel free to install it and mess around, I’d be happy to hear what you think!


Congrats. Thanks for sharing. Good luck with the project!

1 Like

It looks like our discussion on the forum over HERE paid off!

I’ll have to take a look at it.


@curt.kennedy Absolutely–thank you for thoroughly discussing the ideas there!

This reminds me that I should update that topic w/ a link to this one.


Congratulations @chicxulub

I have a question. Why not do zero shot classifications using embeddings?

1 Like

Another approach to classification that I’ve been considering is using embeddings as a front-end, and a feed-forward neural network as the back-end. So the vector coefficients feed the initial input layer, and the output is a single float between 0 and 1, where 1 is in-class, 0 is out-of-class.

Also, since most embedding spaces are non-isotropic, I could even reduce the dimensions of the embedding vector and get away with a much smaller neural network. This would enable me to run thousands of classifications on-the-fly, with cheap hardware resources too. Then post-process the output of this classification bank to characterize what really hit the front-end.

So one embedding vector leads to thousands of classifications!

1 Like

@sps Good question. Embeddings may perform well for simpler tasks, and they are much cheaper. So they should be evaluated before other methods. The biggest problem is that they are unlikely to perform well for slightly more complex language tasks.

One fundamental reason for this is that any model which needs to do well at cosine similarity calculations must be explicitly trained to do well at them! Here is a reference for BERT, another popular LM:

Reimers, Nils, and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-networks.” arXiv preprint arXiv:1908.10084 (2019).

In other words, those token embeddings which are super powerful for next-token prediction, have to be refined or transferred to do cosine similarity. And if there isn’t enough high quality data to train a cosine similarity model, then training even the most powerful LM is unlikely to yield results comparable to next-token prediction. For next-token prediction, the internet is a huge and high quality dataset. There isn’t a comparable dataset to train cosine similarity models. An anecdote: at my last job, training BERT to do well at cosine similarity on our big and high quality dataset resulted in a model which performed worse than zero-shot GPT-3.5!

Enough talk though. Let’s put this idea to the test by having text-embedding-ada-002 do that same product review classification problem in CAPPr’s motivation page.

Run this code in a Python environment w/ openai and numpy installed.

import os

import numpy as np
import openai

openai.api_key = os.getenv('OPENAI_API_KEY')

EMBEDDING_MODEL = "text-embedding-ada-002"

# Classification problem
class_names = ('The product is too expensive',
               'The product uses low quality materials',
               'The product is difficult to use',
               "The product isn't working",
               "The product doesn't look good",
               'The product is great')

product_reviews = ["I can't figure out how to integrate it into my setup."]

# We want a model to predict 'The product is difficult to use'. That's clearly
# the most similar class to the product review 

# Get embeddings (in batches!)
_resp = openai.Embedding.create(model=EMBEDDING_MODEL,
embeddings_class_names = np.array([out['embedding'] for out in _resp['data']])

_resp = openai.Embedding.create(model=EMBEDDING_MODEL,
embeddings_texts = np.array([out['embedding'] for out in _resp['data']])

# Let's verify that embeddings are already normalized. That would mean we just
# have to take the dot product to get the cosine similarity.
def is_normalized(embeddings: np.ndarray) -> bool:
    product = embeddings @ embeddings.T
    return np.allclose(np.diag(product), 1)

assert is_normalized(embeddings_class_names)
assert is_normalized(embeddings_texts)

cosine_similarities = embeddings_texts @ embeddings_class_names.T
# array([[0.752, 0.726, 0.794, 0.808, 0.762, 0.748]])

# From this array, we can already see that the 4th class is considered to be the
# most similar to the product review (in embedding space):
pred_class_idxs = cosine_similarities.argmax(axis=1)
[class_names[pred_class_idx] for pred_class_idx in pred_class_idxs]
# ["The product isn't working"]

There are also errors in that OpenAI notebook you linked. The way it computes probas is wrong, immediately so because it won’t work if there are more than 2 classes! It’s also definitively not a probability, as the label_score function will produce negative values. Finally, it is not a probability distribution. scikit-learn’s PrecisionRecallDisplay.from_predictions hides these errors because precision and recall calculations don’t actually need probabilities, they just need arbitrary scores. The plot could’ve been produced by feeding in raw cosine similarities to the 'positive' class.

The probability calculation should be replaced with something simple and standard like this (extending my code example from above):

def softmax(similarities: np.ndarray) -> np.ndarray:
    exp = np.exp(similarities)
    return exp / np.sum(exp)

pred_probs = softmax(cosine_similarities)
# array([[0.165, 0.16 , 0.171, 0.174, 0.166, 0.164]])

# To drive home the point that these are quite undiscriminative, let's see what a
# uniform distribution over the classes looks like, i.e., what probabilities would
# a random guesser produce?
(np.ones(len(class_names)) / len(class_names)).round(3)
# array([0.167, 0.167, 0.167, 0.167, 0.167, 0.167])

A final remark: I wish sentiment classification was not the de-facto demo for text classification. So many models can do well on sentiment.

1 Like


A couple comments on your code above. To me, it looked like the embedding technique worked! If I didn’t know better, I would think that “I can’t figure out how to integrate it into my setup.” maps to “The product isn’t working”. This is a possible reasonable result, because “The product is difficult to use” precludes that the integration into their setup is established, and they can’t use it, which isn’t the case if they can’t integrate it into their setup.

As for the correlations being close to the random guesser. This is an artifact that ada-002 is non-isotropic (you don’t see dot-products (cosine similarities) that get less than 0.7, they all get compressed between 0.7 and 1.0). The energy is focused in a specific direction in the embedding space, and you need to post-process the embeddings to force more spatial spread.

The consequence though, if you don’t do this, is you get correlations and non-correlations to be close to one another. This isn’t the end of the world though, since you need a factor of 10 or so more precision in the embeddings to overcome this, and 64 or 32 bit floating point numbers represented here easily already accommodate this.

1 Like

I agree that the example isn’t that convincing. Though, having worked somewhat extensively with customers on exactly this type of data, I’m confident that the majority strongly want the product review classified as “The product is difficult to use”. For what it’s worth, text-davinci-003 and ChatGPT will confidently select that as the best summary of the product review. We can agree to disagree. Ideally there’s an actual benchmark to compare zero-shot embeddings vs. zero-shot sampling or CAPPr.

Great point about the embeddings. But I think the problem is more fundamental than numerical precision. The problem is that it’s nontrivial to do calibrated probability estimation using cosine similarities. Unless there’s a study I haven’t seen which says a factor of 10 is a good rule of thumb, you need training data specific to your task in order to estimate that up-scaling parameter.

Thanks for this discussion though. I’ll add to the documentation that zero-shot embeddings should be attempted first if the language task isn’t too difficult, can be framed as a similarity problem, and probabilities are not needed.

1 Like

I believe you, embeddings is not a perfect end-to-end solution, especially when lots of classes are involved, and when these classes have very fine distinctions between them.

But yes, I do think you can use embeddings as a pre-filter, and then for whatever the categories the correlated embeddings point to, you can go with your trained classifiers to distinguish further and hopefully get a much more accurate classification, after the embedding pointed the way initially.

The factor of 10 was more a remark on the compression factor in ada-002, not to be used as a probability metric, although it could be weakly considered one if you linearized it to degrees or radians and had a threshold empirically derived based on previous data. Or post-fit, drop dimensions, and made the embeddings more isotropic (increasing the angle between different things). But this involves a post-fit that will likely drift over time. Making it another annoying chore in the DevOps pipeline.

I’m still trying to wrap my head around CAPPr, and the whole prob(Completion | Input) concept. Convince me why it’s awesome! The only downside (maybe) is that you need a GPU to run it. Also, I thought OpenAI is dropping log probs in the latest models (GPT-4), not sure.

1 Like