Embeddings and Cosine Similarity

Given two documents:

Document 1: “Nothing.” (that is, the document consists of the word “Nothing” followed by a period.)

Document 2: “I love ETB and I feel that people in Europe are much better informed about our strategy than in other regions.”

I retrieved embeddings from three models and calculated the cosine similarity for each case:

ada: 0.5928075
babbage: 0.5404018
curie: 0.5653236

My question: is it possible for someone to verify that my cosine similarities are correct?

Assuming my cosine similarity calculations are correct, I would like to understand why it is that I always get values greater than .5. As I understand, cosine similarity has a range of -1 to 1.

Any guidance would be appreciated.

You should post your code

Here you go, thanks.

class Embedding
{
    public static float Similarity(float[] v1, float[] v2)
    {
        float dotProduct = Embedding.DotProduct(v1, v2);
        float magV1 = Embedding.Magnitude(v1);
        float magV2 = Embedding.Magnitude(v2);
        return dotProduct / (magV1 * magV2);
    }
    public static float DotProduct(float[] v1, float[] v2)
    {
        float val = 0;
        for (Int32 i = 0; i <= v1.Length - 1; i++)
            val += v1[i] * v2[i];
        return val;
    }
    public static float Magnitude(float[] v)
    {
        return Math.Sqrt(Embedding.DotProduct(v, v));
    }
}

In my experience, cosine similarity never really reaches 1.0. As such, it makes it hard to “verify” where the similarity lies in the dimensions. If you compare more items, a ranking will begin to emerge, even if slight.

Thanks for your comment, but I get 1.0 when I should. I posted this question here because I never get anything less than .5 and I want to figure out why.

1 Like

I have the same experience, the cosine similarity is seldom less than 0.5. We use this metric all over the place in our application, BookMapp; and I have millions of pairs of embeddings. I don’t see values much smaller than 0.3.

I just calculated the cosine similarity metrics of about 5,000,000 pairs of texts. The lowest value I got was

0.3228867136157207

Here are the smallest 30.

image

This is for Babbage search embeddings. I would not be able to share the exact texts with you, unfortunately.

Please note that all this text is modern English language text, so may not be as dissimilar as you might think. The semantics may encapsulate the aspects of language, grammar and structure as well.

Here is a pair of text for you to try

Cosine similarity

0.36690756948666003

text 1

Let’s Take a Selfie-Ms. Idea Robber.


text 2

Collaborative Co-author-GPT-3 is the third-generation language prediction model in the GPT-n series created by OpenAI, a San Francisco-based artificial intelligence research laboratory. GPT-3 is very powerful, albeit with some striking limitations as well. The power comes from (a) 175 billion parameters, (b) training over a large portion of web pages from the internet, a giant collection of books, and all of Wikipedia and (c) tasks capability that include text classification, e.g., sentiment analysis, question answering, text generation/summarization, named-entity recognition and language translation. The limitations include (a) lack of long-term memory, (b) Lack of interpretability, (c) Limited input size, (d) Slow inference time, and (e) Suffers from some bias, already.

Thank you very much for sharing your experience. That’s good to know that my experience isn’t unusual. I assume you’re not bothered by the absence of negative numbers.

For your text 1 and text 2 sample, I got these:

ada: 0.609392464
babbage: 0.518622339
curie: 0.5616422
davinci: 0.5140899

So the question would be why my babbage number is so different from yours. Looking at my code elsewhere in this thread, do you see where I’m miscalculating the cosine similarity?

The one’s i tried were search embeddings, you might be using the similarity embeddings. That’s one difference I can see.

Also, since the embeddings are already normalised (the Frobenius norm is 1), you do not need to divide by the norms. That doesn’t hurt but doesn’t help either in this case.

I am bothered by the reduction in resolution, but we got to work with what we have :person_shrugging:

1 Like

Ah, thanks! Now that I’m getting a little more comfortable, I’m seeing that I missed some things in the documentation, one of them being search vs similarity embeddings. And thanks for the tip on the normalization.

The lowest I got with similarity embeddings was 0.43.

I would rather thank you for getting me to restart looking into this quirk that bothered me a few months ago, and I failed to pursue at that time due to being focused elsewhere.

I will get back if I can find something.

Oh, and BTW, I got the same results with bert-uncased (only + similarities); but it has been a while.

Here’s what I found: (Your question caught my fancy :slight_smile: )

a. The histogram of the cosine similarities I have looks like this
image

b. I found a great explanation here. Here is the line that clinches it based on empirical experimentation by the author; which I am trying to replicate.

In the case of max pooling, however, the cosine range is shrinked and moved drastically to positive values, as can be seen in the below graph. (sic)

Also,
The larger the embeddings, the more pronounced the effect is.

2 Likes

OK, I’m glad I’m not alone in this! Thanks for the link, that is indeed interesting. It would be great if someone from OpenAI could jump in and share some insights of their own.

I did some experiments and wrote them up here:
Why are Cosine Similarities of Text embeddings almost always positive? | by Vaibhav Garg | May, 2022 | Medium

4 Likes

Congrats, this is a truly enlightening text. :clap:

Nice work, very informative. Thanks for sharing!

We are using text search embedding to find similar text. Does the API give cosine similarity score or you are calculating them on your own (and how?).

We need to calculate them on our own. Just multiply the 2 sets element wise and sum them together. This works since the embeddings returned are normalised, i.e. their L2 norm is 1.

1 Like