Extracting each word's embeddings from embedded sentence

Welcome to the community!

What you have is an embedding vector: it encodes the semantic meaning of your entire text, as a whole.

You can, in theory (in the sense that there’s nothing stopping you), send individual words against the endpoint to achieve your goal. Each word will also return a gigantic vector of the same dimension.

You then just compute the cosine similarity of each pair of vectors you want to compare.

Does that make sense?


I noted that in theory, there’s nothing stopping you. In practice, these text embedding models aren’t really built for comparing individual words - partly because individual words can be very context sensitive and mean different things depending on how or where they’re used. However, using the embedding model this way might get you 99% of the way where you want to go, and it might even be super good enough. So I definitely do encourage you to try it.

To improve it, if the budget and use-case allows, you could consider using a generative LLM to translate your text into a structured list of contextual definitions, which you can then embed. Embeddings of definitions will generally capture more meaning than embeddings of words alone

3 Likes