Question on text-embedding-ada-002

Yes, ada-002 has poor geometry and is non-isotropic.

I mentioned it initially a while ago over here:

And I came up with a solution that involves PCA (Principal Component Analysis) fitting of the data over here:

Realize that ada-002 only requires dot-products as a distance metric (equivalent to cosine similarity since ada-002 consists of unit vectors.). If you had a TON of embeddings and needed a speedup in your search, you could drop down to a Manhattan metric instead, since this only involves additions (subtractions) and absolute values.

But with a 400k set of embeddings using exhaustive search only takes 1 second using dot-products (inner-products) and basic cloud functions.

3 Likes