Hi!
Do you have any idea why the cosine similarity for embeddings generated by text-embedding-ada-002 is so undifferentiated? What I mean is that when I count the distances for a set of sentences, the standard deviation of the cosine distances for the text-embedding-ada-002 model is much smaller than for models from other vendors.
Let’s look at this example:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['The dog plays in the garden',
'A woman watches TV',
'The new movie is so great']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
The cosine similarities are as follows:
The cat sits outside The dog plays in the garden Score: 0.2838
A man is playing guitar A woman watches TV Score: -0.0327
The new movie is awesome The new movie is so great Score: 0.8939
And the same for OpenAI model:
import openai
#Compute embedding for both lists
def get_open_ai_embeddings(text):
answer = openai.Embedding.create(model="text-embedding-ada-002", input=text)
return answer["data"][0]["embedding"]
embeddings1 = [get_open_ai_embeddings(sentence) for sentence in sentences1]
embeddings2 = [get_open_ai_embeddings(sentence) for sentence in sentences2]
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
Output:
The cat sits outside The dog plays in the garden Score: 0.8671
A man is playing guitar A woman watches TV Score: 0.7827
The new movie is awesome The new movie is so great Score: 0.9695