Why cosine_similarity between embedding vectors is always above .68

I’m encountering an issue where the cosine similarity between embedding vectors extracted using the text-embedding-ada-002 model consistently yields values above 0.68, even for dissimilar or unrelated texts.

For example, when I calculate the cosine similarity between the following two texts:

  • Recession is happening in the economy.
  • Tulips are spring-blooming perennial herbaceous bulbiferous geophytes in the Tulipa genus.

The cosine similarity between above two texts is unexpectedly high at 0.7006508392972203, despite the texts being dissimilar in meaning.

I’m puzzled by this behavior and would appreciate any insights into why the cosine similarity is consistently high even for non-related texts.

I wonder if I am interpreting the cosine similarity number correctly?

PS:

Please note that I have omitted the code to calculate cosine_similarity (very simple) as this inquiry pertains to the conceptual understanding of cosine_similarity between two embedding vectors.

This is just a known feature of ada-002 embeddings. The newer models will go down to 0, but not much negative. Each model has different CS thresholds.

1 Like

I always take embeddings on text-ada-002 models with 1536 dims as relative. Higher CS value above a specific threshold would mean semantic relationship and lower value implies lack of semantic relationship.

Hi @curt.kennedy and @sps ,

This is very helpful. Thank you!

Should I normalize distances between ".68 and 1 " to “0 to 1” then?

Any chance you have a reference for more reading on this topic?

Thanks again.

Higher CS value above a specific threshold

What is the threshold or how do you find a proper threshold ?

It’s been discussed a bunch, here is one example I found:

Some people normalize it. What I do is adjust my thresholds. Usually anything above 0.9 is correlated. Anything less than 0.8 is uncorrelated. And between 0.8 - 0.9 is the grey zone.

But these are rough values, and you should adjust from here given your observations on your own data set.

1 Like

One of the best ways as @curt.kennedy said is to simply test embeddings with your data and observe how the CS values correlate with relevance. Then simply pick a value slightly lower than the lowest CS at which you observe the relevance.

1 Like