How I cluster/segment my text after embeddings process for easy understanding?

I have a dataset with text similar to the one in the example URL. I can create embeddings and add them as a new column in my dataset, just like the ‘search_docs’ function in the example. So far, everything is working well based on this example.

However, I’m currently facing a challenge because I don’t have a big-picture understanding of the text data or know what kind of questions to ask. My objective is to collect all the text from my records into one large text and then identify and classify 15-20 major topics. I want to assign these major topics individually to each row so that all my data falls into 15-20 distinct groups or “buckets”. This task is essentially similar to using K-Means or other clustering techniques.

I would greatly appreciate any assistance or guidance in achieving this goal. Please note that I am using Azure OpenAI and Python for this project.

The URL I mentioned above:

Hi and welcome to the Developer Forum!

Just out of interest, is this a google foobar challenge? It reminds me of one. This is a fairly involved area of study in information theory and semantics in particular involving latent semantic analysis, what have you got code wise so far?

1 Like

You could just use cosine similarity to cluster them.

Just pick a small radius, like 0.01 or something, depending on the embedding model, and amount of embeddings you want to reduce.

Then pick a random embedding to start with and define an abstract starting label “0”. Find all embeddings, that aren’t labeled, within 0.01 of “0”, and label these “0” as well.

Then pick a random non-labeled vector, and find all vectors within 0.01 of this, and give it abstract label “1”, and mark off all these as done as well.

Keep doing this until all embeddings are abstractly labeled.

Then, when done, you have clustered embeddings.

Hopefully the algorithm makes sense. It’s a recursive for-loop, but gets faster and faster each iteration, since you skip computing the similarity for already labeled vectors.

As for the 0.01, this is a hyper parameter. Smaller means your clusters are more distinct, as you will have more clusters. Higher, and you get less clusters, and less distinction.

PS. Oh, and and because you want the categories to be “centered” in the space, you do this whole categorization multiple times, and the final label is the median or average category value. So the random seed should not be set the same for each run, it really needs to be random.

Latent Dirichlet Allocation maybe :thinking:

Looks like a fancy TF-IDF. I had fun reading about it.

But why wouldn’t you just use TF-IDF? Or a combination of these keyword algorithms and embeddings for topic modeling?

I wish there was some comparison of all these topic modeling techniques.

1 Like

I see papers in the making!

(more words)

I’m sure you want an end product (not completely described) instead of an idea or visualization, but enjoy this:

2 Likes

Without getting too crazy and going into “science project” land, you could just use multiple similarity algorithms, both embedding based, and keyword based, and use RRF to combine the results.

That’s probably what I would do. Just pick your top N algorithms, fuse the results. Do this with multiple passes and take the average/median category to “center” each category, over multiple random draws and passes.

Also, with RRF, you could weight each algorithm stream differently. And if you have a priori knowledge, in some universe, that the algorithm performance is a function of the content, then in real-time if you get this sorting, you apply this dynamically to your RRF weighting, as a function of the exact chunk being categorized. [A simple example of this is if the length of the chunk is small, you would increase the weight of the embedding algorithms compared to the keyword algorithms. Another example, if there are lots of numbers, increase the weight of the keyword algorithms]

It could get crazy at this higher algorithm level, and this theoretically is a no-frills statistical technique to get the highest level of topic modeling performance.

With just a simple RRF ranking, like 1, 2, 3, 4, etc., you may have to run this even more times to center things, maybe just pick the top 1 or 2. You could speed up convergence by looking at some numerical measure like cosine similarity or “mutual information” in something like my MIX algorithm, which is like a log-normalized TF-IDF algorithm.

1 Like

If I could put my sometimes weird approaches:

Had to spot suggested categories for a website with more than 3k posts and analyse the semantics of the website.

My approach was:

  1. reduce each of the documents to a list of 30 main subjects/entities present in the text ordered by their importance in the text and relation to the article subject.

  2. Build a list of all subjects/entities present in all documents matched to the number of their appearances

  3. Grouped subjects/entities by parent/child to establish the hierarchy of the items in the list

  4. Calculated “importance” of each item by adding factored number of its own appearances with a sum of appearances of all its children (deep through) $importance =(3*$own_appearances) + sum($child_appearances)

  5. Ordered the list by importance on each level…

You guys follow the idea. Used gpt3.5 to automate semantic tasks.

Oh, almost forgot. Why taking this approach instead of clustering vectors by similarity?

  1. The vector of the full text is more or less “bloated” depending on the size of the text and how “detailed” it is. While operating directly with “distilled” subjects/entities is more “precise” by definition.

  2. Working on vectors first, still needs you to extract the subject out of the vectors groups to translate groups to usable “topics”. Operating directly on subjects, beside point 1 above, also skips you the labeling task of a group and allows to have child topics in the same time.

2 Likes

Interesting, how did you do the document reduction in step 1?

Prompt gpt3.5 to extract subjects, entities, events etc as a list, then embed the whole text and each piece from the list and sort the list by cosine similarity to the whole text vector and get top 30

When the text is over the token limit used recursive summarization (to 50% of original length for each chunk)

2 Likes