Maintaining Cosine Similarity Across Languages

I am comparing embeddings in multiple languages and I would like to maintain the cosine similarity between the same two texts across languages. For example I would like “dog” and “cat” to have approximately the same cosine similarity in English as “dog” and “cat” in another language.

Currently I observe that cosine similarities in another language tend to have higher dot products than cosine similarities in English.

Are there any common approaches to achieve the above?

The two options I see are:

  • Normalize cosine similarity by subtracting the mean for each language
  • Learn a translation matrix as specified in “Exploiting Similarities among Languages for Machine Translation”

However I am relatively new at this so any advice is welcome.


You’re unlikely to find much success in this.

It’s very doubtful the difference in cosine similarities between cat and dog in language 1 and language 2 will be the same as the difference in cosine similarity between cheese and milk in those two languages.

Do you have any particular reason to want to do this? I don’t really see what you hope to accomplish by doing this.