Cosine distance changing with new embedding models?

I am using PGVector to store chunks of information. I create embeddings for each chunk and use these embeddings to find the most relevant chunks to send to the chat engine (including the system prompt and user question).
This works reasonably well however I noticed that changing to the new embedding model(s) creates a larger cosine distance compared to the previous model. Is/Was this to be expected?

It is expected that cosine similarity (dot product) would be different between any embeddings models, needing refactoring of cutoff thresholds you may have placed, just by quality differences.

However there is indeed now a big difference, especially that dissimilar results approach 0 with new models instead of barely dipping below 0.7.

(“Flower arranging” is the Japanese.)

== 3-large cosine similarity comparisons ==

  • 0:" 生け花" <==> 1:“US Presidents” -
    0.04354233
  • 0:" 生け花" <==> 2:“Ronald Reagan” -
    0.03268730
  • 0:" 生け花" <==> 3:“George Bush” -
    0.08465978
  • 1:" US Presidents" <==> 2:“Ronald Reagan” -
    0.45871653
  • 1:" US Presidents" <==> 3:“George Bush” -
    0.48322953
  • 2:" Ronald Reagan" <==> 3:“George Bush” -
    0.55673759

Examine individual comparisons. 0.03 for “Reagan vs Flowers”. 0.56 comparing two presidents by name.

== ada-002 cosine similarity comparisons ==

  • 0:" 生け花" <==> 1:“US Presidents” -
    0.70878423
  • 0:" 生け花" <==> 2:“Ronald Reagan” -
    0.71841771
  • 0:" 生け花" <==> 3:“George Bush” -
    0.73647634
  • 1:" US Presidents" <==> 2:“Ronald Reagan” -
    0.86212640
  • 1:" US Presidents" <==> 3:“George Bush” -
    0.88818758
  • 2:" Ronald Reagan" <==> 3:“George Bush” -
    0.87318237

ok, but is this what is to be expected? I get some results that do not make sense to me as in some chunks getting a better rating as others (cosine-wise) while containing less or no relevant information from the viewpoint of the question. Makes it very difficult to predict what would be relevant text to send to the API and/or what a good cutoff for cosine distance would be.

You can certainly try all three models and see what is most performative for your search task. You can start with a top-5 result instead of a threshold, and also constrict result count by total tokens if they are going back to an AI.

The embeddings are semantically-based, using deeper machine learning than can be articulated. You might have results affected by dimensions that contain aspects of “is professional language” “happened in the USA” “things that fly”…

George Bush is a better result for flower arranging than Reagan? And drastically more in the new model? (Reagan wins for “thermonuclear war”, though).