My question: is it possible for someone to verify that my cosine similarities are correct?
Assuming my cosine similarity calculations are correct, I would like to understand why it is that I always get values greater than .5. As I understand, cosine similarity has a range of -1 to 1.
In my experience, cosine similarity never really reaches 1.0. As such, it makes it hard to “verify” where the similarity lies in the dimensions. If you compare more items, a ranking will begin to emerge, even if slight.
Thanks for your comment, but I get 1.0 when I should. I posted this question here because I never get anything less than .5 and I want to figure out why.
I have the same experience, the cosine similarity is seldom less than 0.5. We use this metric all over the place in our application, BookMapp; and I have millions of pairs of embeddings. I don’t see values much smaller than 0.3.
I just calculated the cosine similarity metrics of about 5,000,000 pairs of texts. The lowest value I got was
0.3228867136157207
Here are the smallest 30.
This is for Babbage search embeddings. I would not be able to share the exact texts with you, unfortunately.
Please note that all this text is modern English language text, so may not be as dissimilar as you might think. The semantics may encapsulate the aspects of language, grammar and structure as well.
Collaborative Co-author-GPT-3 is the third-generation language prediction model in the GPT-n series created by OpenAI, a San Francisco-based artificial intelligence research laboratory. GPT-3 is very powerful, albeit with some striking limitations as well. The power comes from (a) 175 billion parameters, (b) training over a large portion of web pages from the internet, a giant collection of books, and all of Wikipedia and (c) tasks capability that include text classification, e.g., sentiment analysis, question answering, text generation/summarization, named-entity recognition and language translation. The limitations include (a) lack of long-term memory, (b) Lack of interpretability, (c) Limited input size, (d) Slow inference time, and (e) Suffers from some bias, already.
Thank you very much for sharing your experience. That’s good to know that my experience isn’t unusual. I assume you’re not bothered by the absence of negative numbers.
So the question would be why my babbage number is so different from yours. Looking at my code elsewhere in this thread, do you see where I’m miscalculating the cosine similarity?
The one’s i tried were search embeddings, you might be using the similarity embeddings. That’s one difference I can see.
Also, since the embeddings are already normalised (the Frobenius norm is 1), you do not need to divide by the norms. That doesn’t hurt but doesn’t help either in this case.
I am bothered by the reduction in resolution, but we got to work with what we have
Ah, thanks! Now that I’m getting a little more comfortable, I’m seeing that I missed some things in the documentation, one of them being search vs similarity embeddings. And thanks for the tip on the normalization.
I would rather thank you for getting me to restart looking into this quirk that bothered me a few months ago, and I failed to pursue at that time due to being focused elsewhere.
I will get back if I can find something.
Oh, and BTW, I got the same results with bert-uncased (only + similarities); but it has been a while.
Here’s what I found: (Your question caught my fancy )
a. The histogram of the cosine similarities I have looks like this
b. I found a great explanation here. Here is the line that clinches it based on empirical experimentation by the author; which I am trying to replicate.
In the case of max pooling, however, the cosine range is shrinked and moved drastically to positive values, as can be seen in the below graph. (sic)
Also,
The larger the embeddings, the more pronounced the effect is.
OK, I’m glad I’m not alone in this! Thanks for the link, that is indeed interesting. It would be great if someone from OpenAI could jump in and share some insights of their own.
We need to calculate them on our own. Just multiply the 2 sets element wise and sum them together. This works since the embeddings returned are normalised, i.e. their L2 norm is 1.