Generally you’re not going to get what you’re expecting you’re going to get when clustering on very high dimensions.
Unless you have absurdly huge (read actually impossibly large) numbers of data points, clustering will be almost entirely meaningless.
With 1536 dimensions, if we break the space up into “quadrants” so to speak, based on the sign of each vector element, you’ve got 2^1536 bins each vector could land in. That’s 2.41 × 10^462 directions the vector could point in. The vast majority of those will be empty.
Thinking in massively huge dimensions breaks the brain and it becomes increasingly difficult to interpret what is meant by proximity in those dimensions.
So, while it’s certainly possible (if computationally expensive) to cluster vectors in this many dimensions, I’m going to recommend against it unless you have a very clear idea of why it’s appropriate to do in this particular case.
I’m trying to identify common topics in customer reviews. I’m creating an embedding for each review, and I would like to dynamically determine the number of topics/clusters based on what best fits the data, given and min and max number of topics/clusters. I’d also like to find a way to exclude obvious outliers dynamically.
This OpenAI cookbook is exactly what you are looking for (it’s a bit outdated but the fundamentals should be the same). It clusters 1,000 food reviews and then attempts to use Davinci to capture the commonalities.
The dimensions are reduced using TSNE (for visualization) and the clustering is centroid-based using K-Means.
I’ve arrived at results I’m really quite happy with. Operating on the raw embeddings without further post-processing of any kind. Very comparable to what @cori needs the clustering for. I’ve based it on the OpenAI cookbook and essentially just swapped out the clustering algorithm.
I did run into the exact same problem of util.community_detection running indefinitely, for some embeddings. According to the comments, that one uses agglomerative clustering. So that’s what I switched over to in the cookbook, using scikit-learn.