Group togheter suggestions and reduce them

I have a large set of suggestions/ideas. Around 250 000. Many of them are similarly or almost exactly the same.

I want to reduce down the suggestions and categorize them.

I know I have to use embeddings for this but then how do I best reduce them down and categorize them?

I understand you have a collection of some 250k “suggestions/ideas” which you want to condense down into fewer examples.

My first question is, “why?” What are you planning to do with them? Fine-tuning? Embedding? Or something outside of a GPT model altogether?

The reason I ask is because it may not be advantageous to do so (strictly from a results quality standpoint) as the model may benefit from the subtle nuances present which differentiate them.

But, if your have valid reasons for wanting to condense them, one approach you might take would be to,

  1. Create the 250,000 vector embeddings
  2. Do a cluster analysis on your embeddings. E.g. k-means clustering or something similar.
  3. For each cluster, inject all the members of the cluster into context and ask the model to summarize them into one clear and concise example.
  4. Check by encoding the summary and verifying it is at or near the center-of-mass of the cluster.

Alternately, you could skip steps 3 & 4 and just pick the most representative example (central point) of the cluster.

With that said, I cannot promise this will come even close to working. Embeddings are length-1536 vectors. The curse of dimensionality bites hard at that many dimensions, so traditional clustering methods may be unsuited to your needs. But, it’s worth a try if you’ve got the API allowance to burn.

I have two scenarios were I want to use this.

The first one.
Development suggestions that you get from customers. Many of them are the same but with different words and you want that list smaller removing duplicates.

The other one
Users can add sentences for different responses to users. Also these we want to go through and se how many are the same and how many times they are used and populate a condensed list.