Clustering phrases into categories

Any suggestions for how the API can be used to cluster a given set of phrases? There could be 10s to 1000s of phrases. The goal is to group the phrases that are closely related into a set of classes. The classes are not known apriori. Thanks in advance for any hints.

3 Likes

Did you figure this out? I tried but no luck so far.

I’m working on the same thing. Embeddings seem to be the way to go.

1 Like

I found this helpful: openai-python/Visualize_in_3d.ipynb at main · openai/openai-python · GitHub

I am very much interested in using the embeddings feature to further study linguistics and get a better rooted understanding of how everything connects (including our human-made languages we’ve built upon which effectively encapsulates meaning in order to communicate with a larger audience that agreed upon communicating with that same language.)

How would you approach this task of clustering phrases into categories? Are you hoping to find root insight into the origins of communication, trying to derive insight of temporary contextual data (e.g. clustering phrases into categories in hopes to see the formation of a new social trend before it goes mainstream), or something else entirely?

If you think about it, each individual word in a phrase could technically be considered a “category” in which each word is comprised of a unique subset of other words that hold their own unique meaning apart from all other words, where a good metaphor for the above statement would be how we see that prime numbers can’t be broken down into smaller multiples but we see that larger numbers are comprised of prime factors. For example, the word “bear” could be broken down into “a hairy mammal” but other animals into that category so you must encapsulate all the unique attributes of what makes a bear, well, a bear all the while trying not to fall into the common pitfall of making assumptions when defining a word based on a predisposed notion of what defines a bear (not all bears are either black, brown, or white, for example).

Also, are you going to collect your dataset of phrases such that every phrase is unique in structure as well? A good example to explain what I mean would be the two phrases here being structurally different but contextually the same: “The cat, and the dog, walked together” and, “the dog, and the cat, walked together.” Of course, it comes down to choice here, and I’d like to just churn up thoughts and ideas alongside others that share such perspective to collaborate on the overall objective of various topics, approaches of said topics, etc.

Sorry if my feedback is too granular or further away from the feedback you were hoping to receive. I like to take a big-picture approach to things sometimes and this sparked one of those moments in me!

1 Like