How I cluster/segment my text after embeddings process for easy understanding?

mraguthu · October 28, 2023, 6:23pm

I have a dataset with text similar to the one in the example URL. I can create embeddings and add them as a new column in my dataset, just like the ‘search_docs’ function in the example. So far, everything is working well based on this example.

However, I’m currently facing a challenge because I don’t have a big-picture understanding of the text data or know what kind of questions to ask. My objective is to collect all the text from my records into one large text and then identify and classify 15-20 major topics. I want to assign these major topics individually to each row so that all my data falls into 15-20 distinct groups or “buckets”. This task is essentially similar to using K-Means or other clustering techniques.

I would greatly appreciate any assistance or guidance in achieving this goal. Please note that I am using Azure OpenAI and Python for this project.

The URL I mentioned above:

Foxalabs · October 28, 2023, 6:57pm

Hi and welcome to the Developer Forum!

Just out of interest, is this a google foobar challenge? It reminds me of one. This is a fairly involved area of study in information theory and semantics in particular involving latent semantic analysis, what have you got code wise so far?

curt.kennedy · October 28, 2023, 7:49pm

You could just use cosine similarity to cluster them.

Just pick a small radius, like 0.01 or something, depending on the embedding model, and amount of embeddings you want to reduce.

Then pick a random embedding to start with and define an abstract starting label “0”. Find all embeddings, that aren’t labeled, within 0.01 of “0”, and label these “0” as well.

Then pick a random non-labeled vector, and find all vectors within 0.01 of this, and give it abstract label “1”, and mark off all these as done as well.

Keep doing this until all embeddings are abstractly labeled.

Then, when done, you have clustered embeddings.

Hopefully the algorithm makes sense. It’s a recursive for-loop, but gets faster and faster each iteration, since you skip computing the similarity for already labeled vectors.

As for the 0.01, this is a hyper parameter. Smaller means your clusters are more distinct, as you will have more clusters. Higher, and you get less clusters, and less distinction.

PS. Oh, and and because you want the categories to be “centered” in the space, you do this whole categorization multiple times, and the final label is the median or average category value. So the random seed should not be set the same for each run, it really needs to be random.

Foxalabs · October 28, 2023, 9:26pm

Latent Dirichlet Allocation maybe

curt.kennedy · October 28, 2023, 10:26pm

Looks like a fancy TF-IDF. I had fun reading about it.

But why wouldn’t you just use TF-IDF? Or a combination of these keyword algorithms and embeddings for topic modeling?

I wish there was some comparison of all these topic modeling techniques.

Foxalabs · October 28, 2023, 10:28pm

I see papers in the making!

(more words)

_j · October 28, 2023, 11:25pm

I’m sure you want an end product (not completely described) instead of an idea or visualization, but enjoy this:

curt.kennedy · October 29, 2023, 2:31am

Without getting too crazy and going into “science project” land, you could just use multiple similarity algorithms, both embedding based, and keyword based, and use RRF to combine the results.

That’s probably what I would do. Just pick your top N algorithms, fuse the results. Do this with multiple passes and take the average/median category to “center” each category, over multiple random draws and passes.

Also, with RRF, you could weight each algorithm stream differently. And if you have a priori knowledge, in some universe, that the algorithm performance is a function of the content, then in real-time if you get this sorting, you apply this dynamically to your RRF weighting, as a function of the exact chunk being categorized. [A simple example of this is if the length of the chunk is small, you would increase the weight of the embedding algorithms compared to the keyword algorithms. Another example, if there are lots of numbers, increase the weight of the keyword algorithms]

It could get crazy at this higher algorithm level, and this theoretically is a no-frills statistical technique to get the highest level of topic modeling performance.

With just a simple RRF ranking, like 1, 2, 3, 4, etc., you may have to run this even more times to center things, maybe just pick the top 1 or 2. You could speed up convergence by looking at some numerical measure like cosine similarity or “mutual information” in something like my MIX algorithm, which is like a log-normalized TF-IDF algorithm.

sergeliatko · October 29, 2023, 9:51am

If I could put my sometimes weird approaches:

Had to spot suggested categories for a website with more than 3k posts and analyse the semantics of the website.

My approach was:

reduce each of the documents to a list of 30 main subjects/entities present in the text ordered by their importance in the text and relation to the article subject.
Build a list of all subjects/entities present in all documents matched to the number of their appearances
Grouped subjects/entities by parent/child to establish the hierarchy of the items in the list
Calculated “importance” of each item by adding factored number of its own appearances with a sum of appearances of all its children (deep through) $importance =(3*$own_appearances) + sum($child_appearances)
Ordered the list by importance on each level…

You guys follow the idea. Used gpt3.5 to automate semantic tasks.

Oh, almost forgot. Why taking this approach instead of clustering vectors by similarity?

The vector of the full text is more or less “bloated” depending on the size of the text and how “detailed” it is. While operating directly with “distilled” subjects/entities is more “precise” by definition.
Working on vectors first, still needs you to extract the subject out of the vectors groups to translate groups to usable “topics”. Operating directly on subjects, beside point 1 above, also skips you the labeling task of a group and allows to have child topics in the same time.

Foxalabs · October 29, 2023, 11:29am

Interesting, how did you do the document reduction in step 1?

sergeliatko · October 29, 2023, 12:24pm

Prompt gpt3.5 to extract subjects, entities, events etc as a list, then embed the whole text and each piece from the list and sort the list by cosine similarity to the whole text vector and get top 30

When the text is over the token limit used recursive summarization (to 50% of original length for each chunk)

tecnext.licences · April 30, 2024, 9:41am

So you fed the entire document to gpt3.5? Did you chunk the document up and feed the chunks one at a time? Also how would that expand to multiple documents, or to a really large document corpus and trying to identify the top 30 different topics in it or something?

sergeliatko · May 2, 2024, 6:47pm

Basically yes,

No, I don’t chunk for this app - I summarize because all I need is just the list of most important subjects.

Looking into your question I suspect your use have a different purpose. So my question is:

What is your intended usage of it?

sergeliatko · December 18, 2024, 3:59pm

I was wondering, have you worked with graph databases, or was your approach implemented in a purely relational/vector db?

No, purely relational DB + some code and weaviate/openai api calls. I thing doing into graph databases for my use case might be an overkill.

Topic		Replies	Views
Embedding and searching from similar embeddings API	6	6376	October 27, 2023
About the usage of ChatGPT Embedding API	9	4361	August 18, 2023
Help with project approach Community gpt-4 , chatgpt , api	4	357	March 13, 2024
Prompting with the chat/completions API against a large transcript file API	5	3510	October 4, 2023
Embedding - text length vs accuracy? API	13	14881	December 25, 2023

How I cluster/segment my text after embeddings process for easy understanding?

Related topics