I am comparing embeddings in multiple languages and I would like to maintain the cosine similarity between the same two texts across languages. For example I would like “dog” and “cat” to have approximately the same cosine similarity in English as “dog” and “cat” in another language.
Currently I observe that cosine similarities in another language tend to have higher dot products than cosine similarities in English.
Are there any common approaches to achieve the above?
The two options I see are:
- Normalize cosine similarity by subtracting the mean for each language
- Learn a translation matrix as specified in “Exploiting Similarities among Languages for Machine Translation”
However I am relatively new at this so any advice is welcome.
Thanks!