Developing a solution to identify context overlap in documents

Hi, I am trying to write a program which uses umap to identify which chunks have the most contextual overlap. But for one article itslelf I have 100-200 chunks. So when I try to find the umap at the article level, the visualization becomes very cluttered. I am trying to find out if there are any better ways to identify contextual overlap in a large number of articles.

2 Likes

It’s been a while when I was involved with such things. I guess that you already tried reducing dimensions.
Is the problem in visualisation or with umap and finding contextual overlapping?

Don’t do it visually. You can’t do this with just 2 or 3 dimensions. Use cosine similarity based on the embedding vectors. These things have massive dimensions so impossible to visualise properly. Experiment and decide on a sensible threshold.

1 Like

Indeed. If you have really complex case, then you might consider storing documents in a vector store or graph database (such as MongoDB atlas or Neo4j) and let them do the dirty job for you.

Thanks, following this approach now. For each section of the article creating a summary and then using refine chain to create a final summary and create one embedding at article level. Then do cosine to find out the similarities

1 Like