Extracting each word's embeddings from embedded sentence

I’m trying to do cross-lingual word alignment using embeddings for calculating cosine similarity for each word. But contextual embedding entails feeding the API the whole sentence instead of individual words isolated from their contexts. However, what I get is a long list of numbers that I have no idea how to segment to get embeddings for each individual word.
I get this tutorial for getting word-level embeddings from sentence embeddings from chatgpt:

1. Verify the Source of Embeddings:

  • Check Model Output: If you generated these embeddings using a model like BERT or another transformer-based model, ensure that you extracted the embeddings for each token (word) in the sentence, not just the entire sentence. Typically, transformer models output a sequence of embeddings, where each embedding corresponds to a token in the input.

2. Tokenize the Sentence:

  • Use the same tokenizer that was used when generating these embeddings. Most transformer-based models use sub-word tokenization (e.g., WordPiece, BPE). Tokenizing the sentence again will help you align the embeddings with the words or sub-words.
from transformers import BertTokenizer

sentence = "Your input sentence here"
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(sentence)

    1. Segment the Embeddings:
  • If the embeddings list you have corresponds to word-level tokens, then each word or sub-word’s embedding is likely a fixed-length vector. Suppose you used BERT with a hidden size of 768; each token’s embedding would be of length 768.

  • If you have, say, 10 words, then the embeddings list should ideally have a length of 10 x 768.

4. Splitting the List into Word Embeddings:

  • Assuming you have the embeddings list and know the vector size (e.g., 768 for BERT), you can split the embeddings list into chunks of the vector size.
# Example vector size for each word embedding
embedding_size = 768

# Example total number of tokens
num_tokens = len(tokens)

# Assuming 'embeddings' is a flat list containing all embeddings
word_embeddings = [embeddings[i * embedding_size:(i + 1) * embedding_size] for i in range(num_tokens)]

    1. Mapping to Words:
  • Now you can map each word to its corresponding embedding.

word_to_embedding = {token: word_embeddings[i] for i, token in enumerate(tokens)}

Welcome to the community!

What you have is an embedding vector: it encodes the semantic meaning of your entire text, as a whole.

You can, in theory (in the sense that there’s nothing stopping you), send individual words against the endpoint to achieve your goal. Each word will also return a gigantic vector of the same dimension.

You then just compute the cosine similarity of each pair of vectors you want to compare.

Does that make sense?


I noted that in theory, there’s nothing stopping you. In practice, these text embedding models aren’t really built for comparing individual words - partly because individual words can be very context sensitive and mean different things depending on how or where they’re used. However, using the embedding model this way might get you 99% of the way where you want to go, and it might even be super good enough. So I definitely do encourage you to try it.

To improve it, if the budget and use-case allows, you could consider using a generative LLM to translate your text into a structured list of contextual definitions, which you can then embed. Embeddings of definitions will generally capture more meaning than embeddings of words alone

3 Likes

Hi, Diet
Thank you for the advice of using generative LLM to give each word a contextual definition! That’s ingenious! I haven’t thought of that. I’m going to try using gpt-4o-mini to do that to save some money :grinning:

1 Like


The results are quite good! There are still some edge cases that need to be solved. Thanks again for the suggestion!

1 Like