Cosine distance changing with new embedding models?

martin.brekhof · February 7, 2024, 8:42pm

I am using PGVector to store chunks of information. I create embeddings for each chunk and use these embeddings to find the most relevant chunks to send to the chat engine (including the system prompt and user question).
This works reasonably well however I noticed that changing to the new embedding model(s) creates a larger cosine distance compared to the previous model. Is/Was this to be expected?

_j · February 7, 2024, 9:26pm

It is expected that cosine similarity (dot product) would be different between any embeddings models, needing refactoring of cutoff thresholds you may have placed, just by quality differences.

However there is indeed now a big difference, especially that dissimilar results approach 0 with new models instead of barely dipping below 0.7.

(“Flower arranging” is the Japanese.)

== 3-large cosine similarity comparisons ==

0:" 生け花" <==> 1:“US Presidents” -
0.04354233

0:" 生け花" <==> 2:“Ronald Reagan” -
0.03268730

0:" 生け花" <==> 3:“George Bush” -
0.08465978

1:" US Presidents" <==> 2:“Ronald Reagan” -
0.45871653

1:" US Presidents" <==> 3:“George Bush” -
0.48322953

2:" Ronald Reagan" <==> 3:“George Bush” -
0.55673759

Examine individual comparisons. 0.03 for “Reagan vs Flowers”. 0.56 comparing two presidents by name.

== ada-002 cosine similarity comparisons ==

0:" 生け花" <==> 1:“US Presidents” -
0.70878423

0:" 生け花" <==> 2:“Ronald Reagan” -
0.71841771

0:" 生け花" <==> 3:“George Bush” -
0.73647634

1:" US Presidents" <==> 2:“Ronald Reagan” -
0.86212640

1:" US Presidents" <==> 3:“George Bush” -
0.88818758

2:" Ronald Reagan" <==> 3:“George Bush” -
0.87318237

martin.brekhof · February 9, 2024, 10:57pm

ok, but is this what is to be expected? I get some results that do not make sense to me as in some chunks getting a better rating as others (cosine-wise) while containing less or no relevant information from the viewpoint of the question. Makes it very difficult to predict what would be relevant text to send to the API and/or what a good cutoff for cosine distance would be.

_j · February 9, 2024, 11:05pm

You can certainly try all three models and see what is most performative for your search task. You can start with a top-5 result instead of a threshold, and also constrict result count by total tokens if they are going back to an AI.

The embeddings are semantically-based, using deeper machine learning than can be articulated. You might have results affected by dimensions that contain aspects of “is professional language” “happened in the USA” “things that fly”…

George Bush is a better result for flower arranging than Reagan? And drastically more in the new model? (Reagan wins for “thermonuclear war”, though).

Topic		Replies	Views
Reduced Cosine of Similarity relevance scores with "text-embedding-3-small" Vs. "text-embedding-ada-002" API embeddings	2	1659	July 19, 2024
Transitioning to the new embeddings models from ada API embeddings	8	6501	January 27, 2024
Anyone else testing the new V3 embedding models for QnA? API embeddings	6	2237	January 28, 2024
Embeddings for the same content vary. How can this be fixed? API embeddings	5	1257	August 9, 2025
Why are similarity scores lower with text-embedding-3-small? API embedding	5	303	April 2, 2026

Cosine distance changing with new embedding models?

Related topics