Expected Angular Differences in Embedding Random Text?

curt.kennedy · December 31, 2022, 10:19pm

I have a dataset with over 80k random text messages and I embedded each of the messages with ‘text-embedding-ada-002’.

When I pick a message at random, and find the top 10 messages close (+1 dot prodoct), far away (-1 dot product) and orthogonal (0 dot product), all I get are embeddings that are at most 50 degrees away!

The messages range over random spammers and alerts to more common messaging you would expect from millions of people. So I expect to see embeddings that at least have a negative dot product compared to any given embedding chosen at random.

This has me worried that there is a huge chunk of the embedding hypersphere used for things other than relatively short chunks of English text. Is is code maybe? Or languages other than English?

Can anyone give an example using ‘text-embedding-ada-002’ where the angle between two embeddings is even close to 180 degrees, even 90 degrees at this point would be interesting to me.

raymonddavey · December 31, 2022, 10:46pm

Out of interest, are you using dot product or cosine similarity?

Cosine similarity only cares about angle difference, while dot product cares about angle and magnitude.

Sometimes it is desirable to ignore the magnitude, hence cosine similarity is nice, but if magnitude plays a role, dot product would be better. Neither of them is a “distance metric” though.

In general AI (not tested on ada-002), cosine works best with longer text and dot product works best when you have only a few words.

It might be best to do both calculations and come up with a relationship between them that works best in your case

curt.kennedy · January 1, 2023, 12:36am

The vectors coming out of ‘text-embedding-ada-002’ are all unit vectors. So magnitude doesn’t play a role.

raymonddavey · January 1, 2023, 1:00am

Sorry, I missed the comment at the very end of the embedding document that said that the vectors had been normalized. So yes, magnitude is not involved.

Next time I’ll read ALL the documentation

curt.kennedy · January 15, 2023, 10:19pm

OK, to wrap up this topic, I am only able to empirically measure that ‘text-embedding-ada-002’ gives a max angular difference of around 54 degrees. This is out of the possible 180 degrees.

I can also confirm the vectors are not gaussian, because if they were, they would have an angular distribution around 90 degrees (see https://arxiv.org/pdf/1306.0256.pdf for the exact distribution.)

So, not sure why all the embedding vectors live in a ~54 degree wide cone, but they do. So don’t expect to find embeddings are are orthogonal (near 90 degrees) or opposite (near 180 degrees) when using the ‘ada 002’ embedding engine. The 54 degrees is about is big as it gets

anon22939549 · August 17, 2023, 2:56am

I’m reluctant to revive an old topic (especially since we’ve discussed this elsewhere), but for posterity the following pieces of text have an angular difference of about 63°.

v = [" onBindViewHolder",  " segments_remain doseima meaning"]

I might try later to do an embedding of all single token texts and find the maximum angle between them. If you (or anyone else) wanted to crowd-fund doing all ~1E10 two-token embeddings, we could see what the maximum angles are there.

It’s possible that once I have all of the 100k single token embeddings and I identify the most isolated, it might point towards some combination of maximally distant embeddings based on the other experimentation with embedding addition.

curt.kennedy · August 17, 2023, 3:49am

Here is a simple script for trying to discover content with a high angle,

This script generates two random messages of random token lengths (up to K tokens) and compares their embedding angle.

import requests
import numpy as np
import tiktoken
import math

HEADERS = {"Authorization": f"Bearer {OPENAI_KEY}",
                "Content-Type": "application/json"}

enc = tiktoken.get_encoding("cl100k_base")

N = 100263 # max non-special token for cl100k_base ... assuming remaining tokens down to zero are legit
K = 50 # max number of tokens per message synthesized

for ii in range(0,10): # run 10 trials

    K0 = np.random.randint(low = 1,high=K,size=1) # message 0 random size
    K1 = np.random.randint(low = 1,high=K,size=1) # message 1 random size

    Msg0 = enc.decode(list(np.random.randint(low = 0,high=N,size=K0))) 
    Msg1 = enc.decode(list(np.random.randint(low = 0,high=N,size=K1)))


    Payload0 = {
    "model": "text-embedding-ada-002",
    "input": Msg0
    }

    Payload1 = {
    "model": "text-embedding-ada-002",
    "input": Msg1
    }

    r = requests.post("https://api.openai.com/v1/embeddings",json=Payload0,headers=HEADERS)
    q = r.json()
    Embedding0 = q['data'][0]['embedding']

    r = requests.post("https://api.openai.com/v1/embeddings",json=Payload1,headers=HEADERS)
    q = r.json()
    Embedding1 = q['data'][0]['embedding']

    v0 = np.array(Embedding0)
    v1 = np.array(Embedding1)

    c = np.dot(v0,v1)

    print("##################")
    print(f"Msg0: {Msg0}")
    print(f"Msg1: {Msg1}")
    print(f"Dot Product: {c}")
    print(f"Angle: {(180/math.pi)*math.acos(c)}")

One observation is that as the synthesized message lengths increase, the cone tends to be smaller on average. So in the script above, you force K0 and K1 to be the same high number, and your average angle gets smaller (in the limit). There is some sweet spot, though, haven’t measured it.

anon22939549 · August 17, 2023, 4:28am

I saw that as well, I did random draws of 1000 samples of 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, and 2000 tokens and looked at the angles between all of them within and between groups. The 2,000 token samples had—by far—the smallest average angular difference.

My thought on that is that, when sampling uniformly at random the interactions drive the model to some sort of plain vanilla region of the vector space. But, there could be some combination of subset of tokens, or tokens in some particular order that land in some other areas of the space.

My new, just-thought-of-as-I-type-this, hypothesis is that if one were to do some type of clustering on single-token embeddings, identify the most orthogonal clusters and take large within-cluster samples from each and compute the between-group cosine similarities. The bias for embeddings of longer token inputs to be more similar would disappear.

But, I’ll need to wait until I can get on my department server so I can take the cross-product of a 1536x100256 matrix and try to cluster that many points.

curt.kennedy · August 17, 2023, 4:37am

Your 63 degree angle is probably the current Guinness World Record of ada-002 max angles.

One thing I thought of, is if the tokens are “within the same language”, you might get higher angles.

I say this because if you start mixing languages, the meaning is nothing, and so it just ends up in some averaged dead zone of meaning in the vector space, hence the low angle, maybe???

But will certainly be interested if anyone can, at this point, get an angle above 90 degrees.

Anything close to 180 would be absolutely jaw dropping.

Topic		Replies	Views
Question on text-embedding-ada-002 API	12	6423	December 24, 2023
Embedding Results Scale Seems Off API embeddings , ada	8	5125	December 24, 2023
Splitting text into chunks versus reducing the text API embeddings , ada	9	2706	April 5, 2024
Fine-tuning or update embedding of a String Community embeddings	8	1790	August 14, 2023
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	4138	April 9, 2024

Expected Angular Differences in Embedding Random Text?

Related topics