Embeddings and Cosine Similarity

lenwhite6094 · May 6, 2022, 3:12am

Given two documents:

Document 1: “Nothing.” (that is, the document consists of the word “Nothing” followed by a period.)

Document 2: “I love ETB and I feel that people in Europe are much better informed about our strategy than in other regions.”

I retrieved embeddings from three models and calculated the cosine similarity for each case:

ada: 0.5928075
babbage: 0.5404018
curie: 0.5653236

My question: is it possible for someone to verify that my cosine similarities are correct?

Assuming my cosine similarity calculations are correct, I would like to understand why it is that I always get values greater than .5. As I understand, cosine similarity has a range of -1 to 1.

Any guidance would be appreciated.

daveshapautomator · May 6, 2022, 12:42pm

You should post your code

lenwhite6094 · May 6, 2022, 1:42pm

Here you go, thanks.

class Embedding
{
    public static float Similarity(float[] v1, float[] v2)
    {
        float dotProduct = Embedding.DotProduct(v1, v2);
        float magV1 = Embedding.Magnitude(v1);
        float magV2 = Embedding.Magnitude(v2);
        return dotProduct / (magV1 * magV2);
    }
    public static float DotProduct(float[] v1, float[] v2)
    {
        float val = 0;
        for (Int32 i = 0; i <= v1.Length - 1; i++)
            val += v1[i] * v2[i];
        return val;
    }
    public static float Magnitude(float[] v)
    {
        return Math.Sqrt(Embedding.DotProduct(v, v));
    }
}

bram · May 6, 2022, 10:18pm

In my experience, cosine similarity never really reaches 1.0. As such, it makes it hard to “verify” where the similarity lies in the dimensions. If you compare more items, a ranking will begin to emerge, even if slight.

lenwhite6094 · May 6, 2022, 11:15pm

Thanks for your comment, but I get 1.0 when I should. I posted this question here because I never get anything less than .5 and I want to figure out why.

vaibhav.garg · May 9, 2022, 10:29am

I have the same experience, the cosine similarity is seldom less than 0.5. We use this metric all over the place in our application, BookMapp; and I have millions of pairs of embeddings. I don’t see values much smaller than 0.3.

I just calculated the cosine similarity metrics of about 5,000,000 pairs of texts. The lowest value I got was

0.3228867136157207

Here are the smallest 30.

This is for Babbage search embeddings. I would not be able to share the exact texts with you, unfortunately.

vaibhav.garg · May 9, 2022, 10:37am

Please note that all this text is modern English language text, so may not be as dissimilar as you might think. The semantics may encapsulate the aspects of language, grammar and structure as well.

vaibhav.garg · May 9, 2022, 10:49am

Here is a pair of text for you to try

Cosine similarity

0.36690756948666003

text 1

Let’s Take a Selfie-Ms. Idea Robber.

text 2

Collaborative Co-author-GPT-3 is the third-generation language prediction model in the GPT-n series created by OpenAI, a San Francisco-based artificial intelligence research laboratory. GPT-3 is very powerful, albeit with some striking limitations as well. The power comes from (a) 175 billion parameters, (b) training over a large portion of web pages from the internet, a giant collection of books, and all of Wikipedia and (c) tasks capability that include text classification, e.g., sentiment analysis, question answering, text generation/summarization, named-entity recognition and language translation. The limitations include (a) lack of long-term memory, (b) Lack of interpretability, (c) Limited input size, (d) Slow inference time, and (e) Suffers from some bias, already.

lenwhite6094 · May 9, 2022, 2:08pm

Thank you very much for sharing your experience. That’s good to know that my experience isn’t unusual. I assume you’re not bothered by the absence of negative numbers.

For your text 1 and text 2 sample, I got these:

ada: 0.609392464
babbage: 0.518622339
curie: 0.5616422
davinci: 0.5140899

So the question would be why my babbage number is so different from yours. Looking at my code elsewhere in this thread, do you see where I’m miscalculating the cosine similarity?

vaibhav.garg · May 10, 2022, 1:23am

The one’s i tried were search embeddings, you might be using the similarity embeddings. That’s one difference I can see.

Also, since the embeddings are already normalised (the Frobenius norm is 1), you do not need to divide by the norms. That doesn’t hurt but doesn’t help either in this case.

I am bothered by the reduction in resolution, but we got to work with what we have

lenwhite6094 · May 10, 2022, 3:00am

Ah, thanks! Now that I’m getting a little more comfortable, I’m seeing that I missed some things in the documentation, one of them being search vs similarity embeddings. And thanks for the tip on the normalization.

vaibhav.garg · May 10, 2022, 3:59am

The lowest I got with similarity embeddings was 0.43.

vaibhav.garg · May 10, 2022, 4:02am

I would rather thank you for getting me to restart looking into this quirk that bothered me a few months ago, and I failed to pursue at that time due to being focused elsewhere.

I will get back if I can find something.

Oh, and BTW, I got the same results with bert-uncased (only + similarities); but it has been a while.

vaibhav.garg · May 10, 2022, 4:24am

Here’s what I found: (Your question caught my fancy )

a. The histogram of the cosine similarities I have looks like this

b. I found a great explanation here. Here is the line that clinches it based on empirical experimentation by the author; which I am trying to replicate.

In the case of max pooling, however, the cosine range is shrinked and moved drastically to positive values, as can be seen in the below graph. (sic)

Also,
The larger the embeddings, the more pronounced the effect is.

lenwhite6094 · May 10, 2022, 12:39pm

OK, I’m glad I’m not alone in this! Thanks for the link, that is indeed interesting. It would be great if someone from OpenAI could jump in and share some insights of their own.

vaibhav.garg · May 19, 2022, 8:04am

I did some experiments and wrote them up here:
Why are Cosine Similarities of Text embeddings almost always positive? | by Vaibhav Garg | May, 2022 | Medium

jazzcript · May 19, 2022, 8:55am

Congrats, this is a truly enlightening text.

lenwhite6094 · May 19, 2022, 5:20pm

Nice work, very informative. Thanks for sharing!

chinmay1 · July 23, 2022, 4:13pm

We are using text search embedding to find similar text. Does the API give cosine similarity score or you are calculating them on your own (and how?).

vaibhav.garg · July 24, 2022, 2:55pm

We need to calculate them on our own. Just multiply the 2 sets element wise and sum them together. This works since the embeddings returned are normalised, i.e. their L2 norm is 1.

Topic		Replies	Views
Embedding Results Scale Seems Off API embeddings , ada	8	5118	December 24, 2023
`text-embedding-ada-002` API	23	17075	February 6, 2024
Cosine similarity values and embeddings API embeddings	2	226	August 30, 2024
Why cosine_similarity between embedding vectors is always above .68 API embeddings	6	3974	March 1, 2024
Question on text-embedding-ada-002 API	12	6409	December 24, 2023

Embeddings and Cosine Similarity

Related topics