Clustering phrases into categories

cirrus.shakeri · October 12, 2021, 4:54am

Any suggestions for how the API can be used to cluster a given set of phrases? There could be 10s to 1000s of phrases. The goal is to group the phrases that are closely related into a set of classes. The classes are not known apriori. Thanks in advance for any hints.

dplutcho · April 13, 2022, 4:09pm

Did you figure this out? I tried but no luck so far.

lenwhite6094 · April 13, 2022, 5:27pm

I’m working on the same thing. Embeddings seem to be the way to go.

lmccallum · April 13, 2022, 7:32pm

I found this helpful: openai-python/Visualize_in_3d.ipynb at main · openai/openai-python · GitHub

DutytoDevelop · April 18, 2022, 12:00am

I am very much interested in using the embeddings feature to further study linguistics and get a better rooted understanding of how everything connects (including our human-made languages we’ve built upon which effectively encapsulates meaning in order to communicate with a larger audience that agreed upon communicating with that same language.)

How would you approach this task of clustering phrases into categories? Are you hoping to find root insight into the origins of communication, trying to derive insight of temporary contextual data (e.g. clustering phrases into categories in hopes to see the formation of a new social trend before it goes mainstream), or something else entirely?

If you think about it, each individual word in a phrase could technically be considered a “category” in which each word is comprised of a unique subset of other words that hold their own unique meaning apart from all other words, where a good metaphor for the above statement would be how we see that prime numbers can’t be broken down into smaller multiples but we see that larger numbers are comprised of prime factors. For example, the word “bear” could be broken down into “a hairy mammal” but other animals into that category so you must encapsulate all the unique attributes of what makes a bear, well, a bear all the while trying not to fall into the common pitfall of making assumptions when defining a word based on a predisposed notion of what defines a bear (not all bears are either black, brown, or white, for example).

Also, are you going to collect your dataset of phrases such that every phrase is unique in structure as well? A good example to explain what I mean would be the two phrases here being structurally different but contextually the same: “The cat, and the dog, walked together” and, “the dog, and the cat, walked together.” Of course, it comes down to choice here, and I’d like to just churn up thoughts and ideas alongside others that share such perspective to collaborate on the overall objective of various topics, approaches of said topics, etc.

Sorry if my feedback is too granular or further away from the feedback you were hoping to receive. I like to take a big-picture approach to things sometimes and this sparked one of those moments in me!

Topic		Replies	Views
How Can I Use the OpenAI API to Categorize Large Amounts of Text Data? API classification	3	6384	May 23, 2023
Can GPT3 do some classification base on the content of the sentences? API	9	2169	January 25, 2022
Improvement ideas for simple classification? Prompting	1	530	July 27, 2021
How to manage safety issues for large volumes of embeddings Community	5	1039	December 20, 2022
Possible novel Embedding classification technique API	1	722	July 1, 2022

Clustering phrases into categories

Related topics