I don’t know if this is widely known or not, but it turns out you can do pretty good text classification using embeddings alone.
The bare minimum code is something like the below, where I have a ‘Target’ column with a few True values (my positive examples - the type of texts I want to classify as True).
import pandas as pd
import numpy as np
df = pd.read_csv("messages.csv") # 8,000 rows, Message and Target columns
embeddings = np.load("embeddings.npy") # 8,000 rows x 3,072 dimensions
df["Score"] = embeddings @ (embeddings.T @ df.Target) # No ML required
The above example code is for all my ChatGPT messages, and I’m trying to classify the ones where I was not pleased with the response.
With only three positive examples, it does a pretty good job of classifying such messages. The below is sorted by the ‘Score’ column from the code above:
I wrote a whole blog post about it (that explains how to get from sorting to classification, among other things). Text classification with vector embeddings — and no ML model | by David Gilbertson | Jan, 2025 | Medium