Let’s say I have a target text and two candidate texts :
T : “I love cats”
C1 : " I love dogs"
C2 : “I love ice-cream”
I want to create embeddings for those, then rank the candidates according to cosine similarity.
Now, Do I have a way to find also the most relevant words in the embeddings for the similarity ? The most important words ?
You could use something like NLTK to remove stop words like “I”, “the”, etc, which carry little meaning.
After this, it’s a combinatorial experiment to add and remove words to measure the model sensitivity to each word or collection of words.
Maybe more straightforward, and less combinatorial craziness, is form your own set of keywords. You basically take chunks of text, and histogram each word in the chunk. From this, you derive “rare words” which carry more information. Then you use this “rarity index” as a filter, essentially sucking out all important and rare words, in time order, and send those to the embedding engine to capture the essence of meaning.