This has been discussed extensively before, basically the embedding vectors from ada-002 get squished together, leading to high correlations no matter what. You just need to adjust your correlation expectations, or you can batch process out the vector correlation using PCA to make them more isotropic (spread out) in the future.
See …
Hey @ruby_coder @debreuil
Here is the code I wrote to do this. Hope it helps.
import numpy as np
import sklearn.decomposition
import pickle
import time
# Apply 'Algorithm 1' to the ada-002 embeddings to make them isotropic, taken from the paper:
# ALL-BUT-THE-TOP: SIMPLE AND EFFECTIVE POST- PROCESSING FOR WORD REPRESENTATIONS
# Jiaqi Mu, Pramod Viswanath
# This uses Principal Component Analysis (PCA) to 'evenly distribute' the embedding vectors (make them isotropic)
# For more information o…
I have a dataset with over 80k random text messages and I embedded each of the messages with ‘text-embedding-ada-002’.
When I pick a message at random, and find the top 10 messages close (+1 dot prodoct), far away (-1 dot product) and orthogonal (0 dot product), all I get are embeddings that are at most 50 degrees away!
The messages range over random spammers and alerts to more common messaging you would expect from millions of people. So I expect to see embeddings that at least have a negati…