I’m encountering an issue where the cosine similarity between embedding vectors extracted using the text-embedding-ada-002 model consistently yields values above 0.68, even for dissimilar or unrelated texts.
For example, when I calculate the cosine similarity between the following two texts:
Recession is happening in the economy.
Tulips are spring-blooming perennial herbaceous bulbiferous geophytes in the Tulipa genus.
The cosine similarity between above two texts is unexpectedly high at 0.7006508392972203, despite the texts being dissimilar in meaning.
I’m puzzled by this behavior and would appreciate any insights into why the cosine similarity is consistently high even for non-related texts.
I wonder if I am interpreting the cosine similarity number correctly?
PS:
Please note that I have omitted the code to calculate cosine_similarity (very simple) as this inquiry pertains to the conceptual understanding of cosine_similarity between two embedding vectors.
I always take embeddings on text-ada-002 models with 1536 dims as relative. Higher CS value above a specific threshold would mean semantic relationship and lower value implies lack of semantic relationship.
It’s been discussed a bunch, here is one example I found:
Some people normalize it. What I do is adjust my thresholds. Usually anything above 0.9 is correlated. Anything less than 0.8 is uncorrelated. And between 0.8 - 0.9 is the grey zone.
But these are rough values, and you should adjust from here given your observations on your own data set.
One of the best ways as @curt.kennedy said is to simply test embeddings with your data and observe how the CS values correlate with relevance. Then simply pick a value slightly lower than the lowest CS at which you observe the relevance.