I have a dataset with over 80k random text messages and I embedded each of the messages with ‘text-embedding-ada-002’.
When I pick a message at random, and find the top 10 messages close (+1 dot prodoct), far away (-1 dot product) and orthogonal (0 dot product), all I get are embeddings that are at most 50 degrees away!
The messages range over random spammers and alerts to more common messaging you would expect from millions of people. So I expect to see embeddings that at least have a negative dot product compared to any given embedding chosen at random.
This has me worried that there is a huge chunk of the embedding hypersphere used for things other than relatively short chunks of English text. Is is code maybe? Or languages other than English?
Can anyone give an example using ‘text-embedding-ada-002’ where the angle between two embeddings is even close to 180 degrees, even 90 degrees at this point would be interesting to me.
Out of interest, are you using dot product or cosine similarity?
Cosine similarity only cares about angle difference, while dot product cares about angle and magnitude.
Sometimes it is desirable to ignore the magnitude, hence cosine similarity is nice, but if magnitude plays a role, dot product would be better. Neither of them is a “distance metric” though.
In general AI (not tested on ada-002), cosine works best with longer text and dot product works best when you have only a few words.
It might be best to do both calculations and come up with a relationship between them that works best in your case
Sorry, I missed the comment at the very end of the embedding document that said that the vectors had been normalized. So yes, magnitude is not involved.
OK, to wrap up this topic, I am only able to empirically measure that ‘text-embedding-ada-002’ gives a max angular difference of around 54 degrees. This is out of the possible 180 degrees.
I can also confirm the vectors are not gaussian, because if they were, they would have an angular distribution around 90 degrees (see https://arxiv.org/pdf/1306.0256.pdf for the exact distribution.)
So, not sure why all the embedding vectors live in a ~54 degree wide cone, but they do. So don’t expect to find embeddings are are orthogonal (near 90 degrees) or opposite (near 180 degrees) when using the ‘ada 002’ embedding engine. The 54 degrees is about is big as it gets