Why cosine_similarity between embedding vectors is always above .68

ptrader · March 1, 2024, 3:42am

I’m encountering an issue where the cosine similarity between embedding vectors extracted using the text-embedding-ada-002 model consistently yields values above 0.68, even for dissimilar or unrelated texts.

For example, when I calculate the cosine similarity between the following two texts:

Recession is happening in the economy.
Tulips are spring-blooming perennial herbaceous bulbiferous geophytes in the Tulipa genus.

The cosine similarity between above two texts is unexpectedly high at 0.7006508392972203, despite the texts being dissimilar in meaning.

I’m puzzled by this behavior and would appreciate any insights into why the cosine similarity is consistently high even for non-related texts.

I wonder if I am interpreting the cosine similarity number correctly?

PS:

Please note that I have omitted the code to calculate cosine_similarity (very simple) as this inquiry pertains to the conceptual understanding of cosine_similarity between two embedding vectors.

curt.kennedy · March 1, 2024, 4:33am

This is just a known feature of ada-002 embeddings. The newer models will go down to 0, but not much negative. Each model has different CS thresholds.

sps · March 1, 2024, 4:44am

I always take embeddings on text-ada-002 models with 1536 dims as relative. Higher CS value above a specific threshold would mean semantic relationship and lower value implies lack of semantic relationship.

ptrader · March 1, 2024, 4:46am

Hi @curt.kennedy and @sps ,

This is very helpful. Thank you!

Should I normalize distances between ".68 and 1 " to “0 to 1” then?

Any chance you have a reference for more reading on this topic?

Thanks again.

ptrader · March 1, 2024, 4:48am

Higher CS value above a specific threshold

What is the threshold or how do you find a proper threshold ?

curt.kennedy · March 1, 2024, 4:55am

It’s been discussed a bunch, here is one example I found:

Some people normalize it. What I do is adjust my thresholds. Usually anything above 0.9 is correlated. Anything less than 0.8 is uncorrelated. And between 0.8 - 0.9 is the grey zone.

But these are rough values, and you should adjust from here given your observations on your own data set.

sps · March 1, 2024, 5:31am

One of the best ways as @curt.kennedy said is to simply test embeddings with your data and observe how the CS values correlate with relevance. Then simply pick a value slightly lower than the lowest CS at which you observe the relevance.

Topic		Replies	Views
Embedding Results Scale Seems Off API embeddings , ada	8	5136	December 24, 2023
Embeddings and Cosine Similarity API	20	14519	February 25, 2024
Questions on the use of text-embedding-ada-002 model API	4	3144	May 26, 2023
Semantic Textual Similarity - undifferentiated similarities API embeddings , semantic-search	5	1524	December 24, 2023
Cosine similarity values and embeddings API embeddings	2	229	August 30, 2024

Why cosine_similarity between embedding vectors is always above .68

Related topics