I’m trying to do cross-lingual word alignment using embeddings for calculating cosine similarity for each word. But contextual embedding entails feeding the API the whole sentence instead of individual words isolated from their contexts. However, what I get is a long list of numbers that I have no idea how to segment to get embeddings for each individual word.
I get this tutorial for getting word-level embeddings from sentence embeddings from chatgpt:
1. Verify the Source of Embeddings:
- Check Model Output: If you generated these embeddings using a model like BERT or another transformer-based model, ensure that you extracted the embeddings for each token (word) in the sentence, not just the entire sentence. Typically, transformer models output a sequence of embeddings, where each embedding corresponds to a token in the input.
2. Tokenize the Sentence:
- Use the same tokenizer that was used when generating these embeddings. Most transformer-based models use sub-word tokenization (e.g., WordPiece, BPE). Tokenizing the sentence again will help you align the embeddings with the words or sub-words.
from transformers import BertTokenizer
sentence = "Your input sentence here"
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(sentence)
-
- Segment the Embeddings:
-
If the embeddings list you have corresponds to word-level tokens, then each word or sub-word’s embedding is likely a fixed-length vector. Suppose you used BERT with a hidden size of 768; each token’s embedding would be of length 768.
-
If you have, say, 10 words, then the embeddings list should ideally have a length of
10 x 768
.
4. Splitting the List into Word Embeddings:
- Assuming you have the embeddings list and know the vector size (e.g., 768 for BERT), you can split the embeddings list into chunks of the vector size.
# Example vector size for each word embedding
embedding_size = 768
# Example total number of tokens
num_tokens = len(tokens)
# Assuming 'embeddings' is a flat list containing all embeddings
word_embeddings = [embeddings[i * embedding_size:(i + 1) * embedding_size] for i in range(num_tokens)]
-
- Mapping to Words:
-
Now you can map each word to its corresponding embedding.
word_to_embedding = {token: word_embeddings[i] for i, token in enumerate(tokens)}