Text classification with vector embeddings — and no ML model

davidg707 · February 2, 2025, 4:28am

I don’t know if this is widely known or not, but it turns out you can do pretty good text classification using embeddings alone.

The bare minimum code is something like the below, where I have a ‘Target’ column with a few True values (my positive examples - the type of texts I want to classify as True).

import pandas as pd
import numpy as np

df = pd.read_csv("messages.csv")  # 8,000 rows, Message and Target columns
embeddings = np.load("embeddings.npy")  # 8,000 rows x 3,072 dimensions
df["Score"] = embeddings @ (embeddings.T @ df.Target)  # No ML required

The above example code is for all my ChatGPT messages, and I’m trying to classify the ones where I was not pleased with the response.

With only three positive examples, it does a pretty good job of classifying such messages. The below is sorted by the ‘Score’ column from the code above:

I wrote a whole blog post about it (that explains how to get from sorting to classification, among other things). Text classification with vector embeddings — and no ML model | by David Gilbertson | Jan, 2025 | Medium

curt.kennedy · February 7, 2025, 1:34am

Yes, and it works with images using image embedding models too.

And when you increase the SNR by weighting with correlation values across multiple hits … LOOK OUT!

It is one of those hidden gems.

This works for linearly related labels using embeddings. And gives a confidence factor for the label (sigma).

Topic		Replies	Views
Possible novel Embedding classification technique API	1	695	July 1, 2022
Sentence Classification solution API	4	1741	March 4, 2024
Reducing Cost of GPT 4 by using embeddings Prompting	23	10607	May 4, 2023
Using OpenAI Embeddings for article recommendations Community	2	1479	July 24, 2023
Similarity of embeddings at different contextual levels Community embeddings	4	1455	July 29, 2023

Text classification with vector embeddings — and no ML model

Related topics