I have a dataset with over 80k random text messages and I embedded each of the messages with ‘text-embedding-ada-002’.
When I pick a message at random, and find the top 10 messages close (+1 dot prodoct), far away (-1 dot product) and orthogonal (0 dot product), all I get are embeddings that are at most 50 degrees away!
The messages range over random spammers and alerts to more common messaging you would expect from millions of people. So I expect to see embeddings that at least have a negative dot product compared to any given embedding chosen at random.
This has me worried that there is a huge chunk of the embedding hypersphere used for things other than relatively short chunks of English text. Is is code maybe? Or languages other than English?
Can anyone give an example using ‘text-embedding-ada-002’ where the angle between two embeddings is even close to 180 degrees, even 90 degrees at this point would be interesting to me.
Out of interest, are you using dot product or cosine similarity?
Cosine similarity only cares about angle difference, while dot product cares about angle and magnitude.
Sometimes it is desirable to ignore the magnitude, hence cosine similarity is nice, but if magnitude plays a role, dot product would be better. Neither of them is a “distance metric” though.
In general AI (not tested on ada-002), cosine works best with longer text and dot product works best when you have only a few words.
It might be best to do both calculations and come up with a relationship between them that works best in your case
Sorry, I missed the comment at the very end of the embedding document that said that the vectors had been normalized. So yes, magnitude is not involved.
OK, to wrap up this topic, I am only able to empirically measure that ‘text-embedding-ada-002’ gives a max angular difference of around 54 degrees. This is out of the possible 180 degrees.
I can also confirm the vectors are not gaussian, because if they were, they would have an angular distribution around 90 degrees (see https://arxiv.org/pdf/1306.0256.pdf for the exact distribution.)
So, not sure why all the embedding vectors live in a ~54 degree wide cone, but they do. So don’t expect to find embeddings are are orthogonal (near 90 degrees) or opposite (near 180 degrees) when using the ‘ada 002’ embedding engine. The 54 degrees is about is big as it gets
I’m reluctant to revive an old topic (especially since we’ve discussed this elsewhere), but for posterity the following pieces of text have an angular difference of about 63°.
v = [" onBindViewHolder", " segments_remain doseima meaning"]
I might try later to do an embedding of all single token texts and find the maximum angle between them. If you (or anyone else) wanted to crowd-fund doing all ~1E10 two-token embeddings, we could see what the maximum angles are there.
It’s possible that once I have all of the 100k single token embeddings and I identify the most isolated, it might point towards some combination of maximally distant embeddings based on the other experimentation with embedding addition.
Here is a simple script for trying to discover content with a high angle,
This script generates two random messages of random token lengths (up to K tokens) and compares their embedding angle.
import requests
import numpy as np
import tiktoken
import math
HEADERS = {"Authorization": f"Bearer {OPENAI_KEY}",
"Content-Type": "application/json"}
enc = tiktoken.get_encoding("cl100k_base")
N = 100263 # max non-special token for cl100k_base ... assuming remaining tokens down to zero are legit
K = 50 # max number of tokens per message synthesized
for ii in range(0,10): # run 10 trials
K0 = np.random.randint(low = 1,high=K,size=1) # message 0 random size
K1 = np.random.randint(low = 1,high=K,size=1) # message 1 random size
Msg0 = enc.decode(list(np.random.randint(low = 0,high=N,size=K0)))
Msg1 = enc.decode(list(np.random.randint(low = 0,high=N,size=K1)))
Payload0 = {
"model": "text-embedding-ada-002",
"input": Msg0
}
Payload1 = {
"model": "text-embedding-ada-002",
"input": Msg1
}
r = requests.post("https://api.openai.com/v1/embeddings",json=Payload0,headers=HEADERS)
q = r.json()
Embedding0 = q['data'][0]['embedding']
r = requests.post("https://api.openai.com/v1/embeddings",json=Payload1,headers=HEADERS)
q = r.json()
Embedding1 = q['data'][0]['embedding']
v0 = np.array(Embedding0)
v1 = np.array(Embedding1)
c = np.dot(v0,v1)
print("##################")
print(f"Msg0: {Msg0}")
print(f"Msg1: {Msg1}")
print(f"Dot Product: {c}")
print(f"Angle: {(180/math.pi)*math.acos(c)}")
One observation is that as the synthesized message lengths increase, the cone tends to be smaller on average. So in the script above, you force K0 and K1 to be the same high number, and your average angle gets smaller (in the limit). There is some sweet spot, though, haven’t measured it.
I saw that as well, I did random draws of 1000 samples of 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, and 2000 tokens and looked at the angles between all of them within and between groups. The 2,000 token samples had—by far—the smallest average angular difference.
My thought on that is that, when sampling uniformly at random the interactions drive the model to some sort of plain vanilla region of the vector space. But, there could be some combination of subset of tokens, or tokens in some particular order that land in some other areas of the space.
My new, just-thought-of-as-I-type-this, hypothesis is that if one were to do some type of clustering on single-token embeddings, identify the most orthogonal clusters and take large within-cluster samples from each and compute the between-group cosine similarities. The bias for embeddings of longer token inputs to be more similar would disappear.
But, I’ll need to wait until I can get on my department server so I can take the cross-product of a 1536x100256 matrix and try to cluster that many points.
Your 63 degree angle is probably the current Guinness World Record of ada-002 max angles.
One thing I thought of, is if the tokens are “within the same language”, you might get higher angles.
I say this because if you start mixing languages, the meaning is nothing, and so it just ends up in some averaged dead zone of meaning in the vector space, hence the low angle, maybe???
But will certainly be interested if anyone can, at this point, get an angle above 90 degrees.
Anything close to 180 would be absolutely jaw dropping.