Embedding Results Scale Seems Off

Hello, community.
I am trying to compare Add 002’s embeddings against other common vectorizers, and while Ada 002 does seem to be better than the other metrics against which I’ve tested it (e.g., TF-IDF, Euclidean Distance, Jaccard Distance, and various BERT embedding models), the scale seems odd to me.
I used the first chapter of Mark from various translations of the Bible (since they all say the same thing in different ways), along with the first part of the Constitution (same length).

The cosine similarity for the embeddings between translations were really high, 0.90 - 0.99, but the similarity for the Constitution was 0.73 to 0.78. Compared to the BERT models, where the Constitution’s similarity values were 0.06 - 0.23.

I’m just curious why the scale for the similarity values seems so much smaller for GPT’s Ada compared to BERT. That seems to water down the predictive value of the model if all the embeddings are so close.

As an aside, I saw one post here in the community that said they generated embeddings from a fine-tuned model. Is that possible? Is there any official documentation about doing that?

If you are seeing values from +/- 0.1 then you’re not renormalising back to unit vectors, see this post Some questions about text-embedding-ada-002’s embedding - #41 by curt.kennedy

The normalizing back out to unit vectors was in the context of me trying to “fix” the issue that ada-002 isn’t isotropic. What this means is that no matter what data you put into ada-002, your cosine similarity won’t go much below 0.7.

So in that post referenced above, I used PCA to make the embeddings from ada-002 more isotropic, so that the cosine similarity will go below 0.7, and even negative!

The problem is this requires a post-fit (batch processing), so you need to have a pile of embeddings sitting there that you want to process. Then follow-on embeddings would get transformed by this process, and your resulting embeddings would be more isotropic, or more varied around the unit hypersphere (much more cosine similarity diversity).

1 Like

Ah, OK. Looks like I got the wrong end of the stick then. But maybe found the right person to help :smile:

I am also seeing cosine similarity values consistently in the 0.7 range and was shocked at the contrast with BERT models. I also tried to sanity check by comparing my start term ‘politics’ to random letters and it was also 0.77

The following strings had these cosine similarities with ‘politics’
0.7737626287099602 fezpoof
0.7585620932449633 Curry hour
0.7610159835788786 Whose Really Supporting Russia
0.7911806266053704 The Perfect Hillary Clinton Analogy
0.7849508209736753 The Evolution of Alex Jones
0.7476810271551438 Patrick Bet David on The Breakfast Club
0.8136755893151938 The Truth About The 2020 Election
0.7559975196314659 Kobe Bryant’s Last Great Interview

1 Like

Maybe this is a pedantic question, but if we agree that the true range for the cosine similarities based on Ada’s embedding vectors is 0.7 - 1.0, why can’t we just normalize the range whereby
0.7 = 0,
.85 = .5 and
1.0 = 1.0?

It’s not perfect, but it makes comparison with BERT much easier.

Could be the normalization techniques. If the embeddings are normalized there might be a disparity between negative and positive similarities, with negative being less common, if so, this ends up with a restricted range of values. You could use a linear transform and normalise them yourself I guess.

The minimum cosine similarity of 0.7 from ada-002 is a “feature”. :face_with_monocle:

You’d think it should have a min of -1, right? Well it doesn’t. The model isn’t isotropic, and has all correlations between 0.7 and 1. (Instead of ranging from -1 to 1).

You could batch process a set of embeddings, and transform them to process out this feature using PCA, but it could be more work than is necessary.

Instead, what I do, is calibrate my correlations, similar to a “0.7 :arrow_forward: -1” mapping.

Also tighten the limits of “closeness”. So instead of thinking everything within 0.1 is close, think more like 0.01 or even 0.001.

More info on solutions and motivation:

Paper here:

And implemented it a few posts later here:

After post processing ~60k embeddings, I was finally getting 0 and negative dot products. The values “made sense” empirically, but there was no way for me to fully validate all the results. But give it a shot, especially when you care about geometry, to do, for example, analogy searches (on word or sentence level), like so: