Semantic Textual Similarity - undifferentiated similarities

Hi!
Do you have any idea why the cosine similarity for embeddings generated by text-embedding-ada-002 is so undifferentiated? What I mean is that when I count the distances for a set of sentences, the standard deviation of the cosine distances for the text-embedding-ada-002 model is much smaller than for models from other vendors.

Let’s look at this example:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The cosine similarities are as follows:

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2838
A man is playing guitar 		 A woman watches TV 		 Score: -0.0327
The new movie is awesome 		 The new movie is so great 		 Score: 0.8939

And the same for OpenAI model:

import openai
#Compute embedding for both lists
def get_open_ai_embeddings(text):
    answer = openai.Embedding.create(model="text-embedding-ada-002", input=text)
    return answer["data"][0]["embedding"]
embeddings1 = [get_open_ai_embeddings(sentence) for sentence in sentences1]
embeddings2 = [get_open_ai_embeddings(sentence) for sentence in sentences2]

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

Output:

The cat sits outside 		 The dog plays in the garden 		 Score: 0.8671
A man is playing guitar 		 A woman watches TV 		 Score: 0.7827
The new movie is awesome 		 The new movie is so great 		 Score: 0.9695

I found a similar topic:

but in my opinion the answer to the question I asked is not there.

Hi,

Is there a particular issue you are facing with the difference in the numbers being smaller? The results from ada are indeed clustered closer together and in a smaller range, but the performance seems to be great.

1 Like

Yes, I agree that performance is good, but I am concerned about the interpretability of the embeds. If the distances between embeddings are so close, how can we confidently distinguish between different pieces of text? It seems to me that this can be a problem when we want to distinguish between texts that differ only slightly in meaning. Would the model be able to pick up these nuances if the embeddings are so closely clustered?
For applications such as anomaly detection or highly detailed semantic analyses where fine distinctions are crucial, a smaller standard deviation may not be ideal.

I also ask out of curiosity :slight_smile:

Well if the difference was 0.1 and 0.9 between embeds of two semantically opposite strings, we can clearly see a large differential at that scale, now if the difference is only 0.0001 and 0.0002 then they seem very similar but at the scale of 0.00001 they are an order of magnitude apart. If the differences were so small that they were at the limit of floating point resolution there would be a problem, but so long as they are far from that limit the differential being small is only perspective.