Ada Embeddings with Fast Clustering

alvaro.mseixas · June 9, 2023, 5:11pm

Hello,

I’m trying to cluster Ada embeddings using Fast Clustering, but can’t make it work.

I embedded only 9 paragraphs by doing:
features_tensor = torch.tensor(np.vstack(df.embedding.values))

The resulting shape is pretty wide: torch.Size([9, 1536])

And then I try to cluster by doing:
clusters = util.community_detection(features_tensor, min_community_size=2, threshold=0.5)

The code runs indefinitely.

The machine I’m running this is below average… Inter i5-3337U + 8GB, but I don’t know if it’s a hardware and/or code issue.

My question is: Is the above code the best approach to using Fast Clustering? Should I reduce the the dimensionality of ‘features_tensor’ by employing UMAP or PCA?

Any help is really appreciated!

curt.kennedy · June 9, 2023, 6:10pm

My guess is that it’s your “threshold=0.5” parameter. Everything coming out of ada-002 is within 0.3 dot-product. So everything is going to get clustered together with that setting.

But you say “pretty wide”, if you are looking at 1536, that is the vector dimension. But if you are talking about the 9, that is probably your threshold putting everything in the same bucket.

alvaro.mseixas · June 12, 2023, 1:13pm

Hey, Curt!

Thanks for your reply.

I changed the code to “threshold=0.3”, “threshold=0.2”, “min_community_size=1”, but it keeps running indefinitely.

I’m about to give up on Fast Clustering and give Bertopic a try…

My goal is just to cluster the embeddings, without setting the number of clusters in advance, and run some analyses from there.

Best

cori · September 1, 2023, 4:32pm

Hello Alvaro,
I’m also trying to cluster embeddings without setting a number of clusters in advance. Have you found a method that works well for you? I’m looking to do this in python.

elmstedt · September 1, 2023, 5:42pm

@cori

Generally you’re not going to get what you’re expecting you’re going to get when clustering on very high dimensions.

Unless you have absurdly huge (read actually impossibly large) numbers of data points, clustering will be almost entirely meaningless.

With 1536 dimensions, if we break the space up into “quadrants” so to speak, based on the sign of each vector element, you’ve got 2^1536 bins each vector could land in. That’s 2.41 × 10^462 directions the vector could point in. The vast majority of those will be empty.

Thinking in massively huge dimensions breaks the brain and it becomes increasingly difficult to interpret what is meant by proximity in those dimensions.

So, while it’s certainly possible (if computationally expensive) to cluster vectors in this many dimensions, I’m going to recommend against it unless you have a very clear idea of why it’s appropriate to do in this particular case.

cori · September 1, 2023, 7:51pm

@elmstedt
I follow. Can you recommend some techniques to reduce my dimensions? Or how to determine how many dimensions I should use for my clustering?

elmstedt · September 1, 2023, 8:44pm

First, let me ask what your goal is with clustering?

We might be able to come up with a different way to meet the same goal.

I don’t think any of the standard dimension reduction techniques will be terribly useful, but you’re welcome to try.

If you really want to try clustering, you can just generate the cosine-distance matrix and perform hierarchical-clustering.

You can either pre-select the number of clusters or identify a distance threshold which will determine the size and number of clusters.

cori · September 5, 2023, 2:18pm

I’m trying to identify common topics in customer reviews. I’m creating an embedding for each review, and I would like to dynamically determine the number of topics/clusters based on what best fits the data, given and min and max number of topics/clusters. I’d also like to find a way to exclude obvious outliers dynamically.

RonaldGRuckus · September 5, 2023, 3:25pm

This OpenAI cookbook is exactly what you are looking for (it’s a bit outdated but the fundamentals should be the same). It clusters 1,000 food reviews and then attempts to use Davinci to capture the commonalities.

The dimensions are reduced using TSNE (for visualization) and the clustering is centroid-based using K-Means.

github.com

openai/openai-cookbook/blob/main/examples/Clustering.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering\n",
    "\n",
    "We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [

This file has been truncated. show original

alvaro.mseixas · September 5, 2023, 7:38pm

Hey, @cori !

My use case was the same as yours! But, unfortunately, I haven’t gone far: couldn’t make Fast Clustering work and didn’t have good results with BERTopic.

If you do find something useful, please, let me know!

@RonaldGRuckus, K-Means requires to set the number of clusters in advance, which is not the case here.

Best

dennis-beams · September 11, 2023, 8:36am

I’ve arrived at results I’m really quite happy with. Operating on the raw embeddings without further post-processing of any kind. Very comparable to what @cori needs the clustering for. I’ve based it on the OpenAI cookbook and essentially just swapped out the clustering algorithm.

I did run into the exact same problem of util.community_detection running indefinitely, for some embeddings. According to the comments, that one uses agglomerative clustering. So that’s what I switched over to in the cookbook, using scikit-learn.

Essentially just swapping out

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_

for

from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(n_clusters=None, affinity="euclidean", linkage="ward", distance_threshold=0.9)
clustering.fit(matrix)
labels = clustering.labels_

Where distance_threshold is the magic you’ll want to experiment with based on your data.

I’ve experimented with various clustering algorithms and this one seems to work best, which is also what I’ve read in various other posts here.

jochenschultz · September 14, 2023, 9:12am

Great first post. Welcome to the developer community!

Topic		Replies	Views
How I cluster/segment my text after embeddings process for easy understanding? API	12	4705	May 2, 2024
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	2606	April 9, 2024
How to manage safety issues for large volumes of embeddings Community	5	768	December 20, 2022
Reducing Cost of GPT 4 by using embeddings Prompting	23	8642	May 4, 2023
Question on text-embedding-ada-002 API	12	5301	December 24, 2023

Ada Embeddings with Fast Clustering

Related Topics