Embedding testing with different models

I was testing between cohere, palm and openai embeddings. I observed a very peculiar thing and not able to explain that. I ran 2 tests. In the first test I took three context and the corresponding question from SQuAD - the Stanford Question Answering Dataset . I created embedding for both context and the question and then did a cosine similarity with all the models. I saw that embed_multilingual(cohere) and embed_ada(open ai) gave very good similarity score. I was glad to see that. But in my next test, I shuffled the context and the questions to make the questions not relate to the context. In that situation also, embed_multilingual and embed_ada are giving very high similarity scores. This looked very strange to me. My kaggle notebook is here text_embedding_comp | Kaggle

You really can’t compare similarity scores across different models because every model has a certain amount of vector bias and vector concentration.

For ada-002, this can be proven easily by embedding two random strings and notice that the cosine similarity never goes below 0.7. This “concentration” of the embedding space makes that embedding space “non-isotropic”. You can make a space more isotropic (more spread out) by using principal component analysis and bias removal. I have a post on this, in this forum, if you are interested.

Unfortunately you need to post-fit the data (so you already need a pile of embeddings to “learn” from). In the end, it may not be worth it, and just change your expectations of the variation of cosine similarity of each model you are dealing with.

But you can fix it, somewhat, to give the model reasonable geometry. This could be useful in things like word/sentence analogy search using geometric techniques, or other vector comparisons that involve optimizing some geometric property over the space.

1 Like