Embedding Results Scale Seems Off

The minimum cosine similarity of 0.7 from ada-002 is a “feature”. :face_with_monocle:

You’d think it should have a min of -1, right? Well it doesn’t. The model isn’t isotropic, and has all correlations between 0.7 and 1. (Instead of ranging from -1 to 1).

You could batch process a set of embeddings, and transform them to process out this feature using PCA, but it could be more work than is necessary.

Instead, what I do, is calibrate my correlations, similar to a “0.7 :arrow_forward: -1” mapping.

Also tighten the limits of “closeness”. So instead of thinking everything within 0.1 is close, think more like 0.01 or even 0.001.

More info on solutions and motivation:

Paper here:

And implemented it a few posts later here:

After post processing ~60k embeddings, I was finally getting 0 and negative dot products. The values “made sense” empirically, but there was no way for me to fully validate all the results. But give it a shot, especially when you care about geometry, to do, for example, analogy searches (on word or sentence level), like so:

4 Likes