@sps Good question. Embeddings may perform well for simpler tasks, and they are much cheaper. So they should be evaluated before other methods. The biggest problem is that they are unlikely to perform well for slightly more complex language tasks.
One fundamental reason for this is that any model which needs to do well at cosine similarity calculations must be explicitly trained to do well at them! Here is a reference for BERT, another popular LM:
Reimers, Nils, and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-networks.” arXiv preprint arXiv:1908.10084 (2019).
In other words, those token embeddings which are super powerful for next-token prediction, have to be refined or transferred to do cosine similarity. And if there isn’t enough high quality data to train a cosine similarity model, then training even the most powerful LM is unlikely to yield results comparable to next-token prediction. For next-token prediction, the internet is a huge and high quality dataset. There isn’t a comparable dataset to train cosine similarity models. An anecdote: at my last job, training BERT to do well at cosine similarity on our big and high quality dataset resulted in a model which performed worse than zero-shot GPT-3.5!
Enough talk though. Let’s put this idea to the test by having text-embedding-ada-002
do that same product review classification problem in CAPPr’s motivation page.
Run this code in a Python environment w/ openai
and numpy
installed.
import os
import numpy as np
import openai
openai.api_key = os.getenv('OPENAI_API_KEY')
EMBEDDING_MODEL = "text-embedding-ada-002"
# Classification problem
class_names = ('The product is too expensive',
'The product uses low quality materials',
'The product is difficult to use',
"The product isn't working",
"The product doesn't look good",
'The product is great')
product_reviews = ["I can't figure out how to integrate it into my setup."]
# We want a model to predict 'The product is difficult to use'. That's clearly
# the most similar class to the product review
# Get embeddings (in batches!)
_resp = openai.Embedding.create(model=EMBEDDING_MODEL,
input=class_names)
embeddings_class_names = np.array([out['embedding'] for out in _resp['data']])
_resp = openai.Embedding.create(model=EMBEDDING_MODEL,
input=product_reviews)
embeddings_texts = np.array([out['embedding'] for out in _resp['data']])
# Let's verify that embeddings are already normalized. That would mean we just
# have to take the dot product to get the cosine similarity.
def is_normalized(embeddings: np.ndarray) -> bool:
product = embeddings @ embeddings.T
return np.allclose(np.diag(product), 1)
assert is_normalized(embeddings_class_names)
assert is_normalized(embeddings_texts)
cosine_similarities = embeddings_texts @ embeddings_class_names.T
cosine_similarities.round(3)
# array([[0.752, 0.726, 0.794, 0.808, 0.762, 0.748]])
# From this array, we can already see that the 4th class is considered to be the
# most similar to the product review (in embedding space):
pred_class_idxs = cosine_similarities.argmax(axis=1)
[class_names[pred_class_idx] for pred_class_idx in pred_class_idxs]
# ["The product isn't working"]
There are also errors in that OpenAI notebook you linked. The way it computes probas
is wrong, immediately so because it won’t work if there are more than 2 classes! It’s also definitively not a probability, as the label_score
function will produce negative values. Finally, it is not a probability distribution. scikit-learn’s PrecisionRecallDisplay.from_predictions
hides these errors because precision and recall calculations don’t actually need probabilities, they just need arbitrary scores. The plot could’ve been produced by feeding in raw cosine similarities to the 'positive'
class.
The probability calculation should be replaced with something simple and standard like this (extending my code example from above):
def softmax(similarities: np.ndarray) -> np.ndarray:
exp = np.exp(similarities)
return exp / np.sum(exp)
pred_probs = softmax(cosine_similarities)
pred_probs.round(3)
# array([[0.165, 0.16 , 0.171, 0.174, 0.166, 0.164]])
# To drive home the point that these are quite undiscriminative, let's see what a
# uniform distribution over the classes looks like, i.e., what probabilities would
# a random guesser produce?
(np.ones(len(class_names)) / len(class_names)).round(3)
# array([0.167, 0.167, 0.167, 0.167, 0.167, 0.167])
A final remark: I wish sentiment classification was not the de-facto demo for text classification. So many models can do well on sentiment.