I have a dataset with over 80k random text messages and I embedded each of the messages with ‘text-embedding-ada-002’.
When I pick a message at random, and find the top 10 messages close (+1 dot prodoct), far away (-1 dot product) and orthogonal (0 dot product), all I get are embeddings that are at most 50 degrees away!
The messages range over random spammers and alerts to more common messaging you would expect from millions of people. So I expect to see embeddings that at least have a negative dot product compared to any given embedding chosen at random.
This has me worried that there is a huge chunk of the embedding hypersphere used for things other than relatively short chunks of English text. Is is code maybe? Or languages other than English?
Can anyone give an example using ‘text-embedding-ada-002’ where the angle between two embeddings is even close to 180 degrees, even 90 degrees at this point would be interesting to me.
OK, to wrap up this topic, I am only able to empirically measure that ‘text-embedding-ada-002’ gives a max angular difference of around 54 degrees. This is out of the possible 180 degrees.
I can also confirm the vectors are not gaussian, because if they were, they would have an angular distribution around 90 degrees (see https://arxiv.org/pdf/1306.0256.pdf for the exact distribution.)
So, not sure why all the embedding vectors live in a ~54 degree wide cone, but they do. So don’t expect to find embeddings are are orthogonal (near 90 degrees) or opposite (near 180 degrees) when using the ‘ada 002’ embedding engine. The 54 degrees is about is big as it gets