Some questions about text-embedding-ada-002’s embedding

A quick test, if I subtract out the mean of 70 samples, I get much more sensible results. the first 5 are chatGPT suggestions for dissimilar sentences, and the next 5 for similar:

-0.017663949682680865
-0.07352484277035345
-0.05597005318789076
-0.009429209531217298
-0.06919492370655664

0.6165518204173611
0.5964354661570286
0.7516415313500149
0.8033141561180126
0.6907252749720518

Sentences were:

"The cat sat on the mat" , "The number 42 is the answer to the ultimate question of life, the universe, and everything",
"The grass is green" , "The stock market has been very volatile recently",
"I like to play chess" , "The weather is hot today",
"She is a nurse" , "The company's profits have been increasing",
"The sun rises in the east" , "The movie was not as good as the book",

"The cat sat on the mat" , "The feline was perched on the rug",
"I like to play chess" , "I enjoy playing the strategy game",
"She is a nurse" , "She works in healthcare",
"The sun rises in the east" , "The morning star rises in the east",
"The apple fell from the tree" , "The fruit dropped from the branches"
3 Likes

Are those values of the actual cosine similarities / dot products that YOU computed from the sentences, not ChatGPT, and using the ada-002 embedding engine?

1 Like

Ah yeah, I have a bit of a unusual setup as it is in a windows app.

I get the embeddings for all the sentences from the ada-002 api. (the colored line chart). These are from the API directly.

I averaged them all by parameter, leaving a 1536 array (the blue line chart)

Then I go through each pair, subtract that average value from each corresponding value, leaving a modified array for the left and right sentences.

Last I calculate the cosine similarity between these new arrays. I’m using a framework called Accord to calculate that, but I assume it is the same.

I’ve tried French, numbers, html but the shape seems pretty persistent :slight_smile:

Also full disclosure, it’s not impossible I’m doing something wrong - day one with this api.

When you subtract the mean you are changing the angle of the vector.

Example in 2D. x = [0, 1], the angle is at 90 degrees

Subtract the mean of 0.5: y = [-0.5, 0.5], the angle is now 135 degrees.

It’s interesting that the coordinates are correlated quite a bit, but to get a true answer, I would just take the dot product of the unmodified embedding vectors and report out what you see there.

5 Likes

Yeah, the dot product and cosine similarity are the same without the adjustment which makes sense if they are normalized. I was just thinking the spikes are so large and correlation so strong it kind of drowns out the signal. But like you say, that is the true answer, and it is still there in the decimal places.

Tbh I’m not versed enough in embedding techniques to justify opinions on any of this :). Thanks for your help.

Lots of papers out there on embeddings, here is an example algorithm to find an analogy using embeddings:

When given a pair of words a and and a third word b, the analogy relationship between aand can be used to find the corresponding word to b. Mathematically, it is expressed as

(7)

where the blank is . One example could be

(8)

The 3CosAdd method [Reference Mikolov, Yih and Zweig36] solves for ![|14x13]
There are some interesting papers on embeddings… For example, here is how to find analogous words from embeddings:

using the following equation:

(9)

Reference:

2 Likes

Yeah, I’m confused by this. Using the current model: “text-similarity-ada-001”, the similarity numbers are quite different than above.

Then I run this, I get a very different result.

Check similarities:

irb(main):004:0> params={string1:a,string2:b, method:'cosine'}
=> 
{:string1=>"The cat sat on the mat",                        
...                                                         
irb(main):005:0> Embeddings.test_strings(params)
=> 
{:string1=>"The cat sat on the mat",
 :string2=>"The number 42 is the answer to the ultimate question of life, the universe, and everything",
 :method=>"cosine",
 :output=>0.6925531623476415}
irb(main):006:0> params={string1:a,string2:b, method:'dot'}
=> 
{:string1=>"The cat sat on the mat",
...
irb(main):007:0> Embeddings.test_strings(params)
=> 
{:string1=>"The cat sat on the mat",
 :string2=>"The number 42 is the answer to the ultimate question of life, the universe, and everything",
 :method=>"dot",
 :output=>0.6925531596482191}

Check distances:

irb(main):008:0> params={string1:a,string2:b, method:'manhattan'}
=> 
{:string1=>"The cat sat on the mat",
...                             
irb(main):009:0> Embeddings.test_strings(params)
=> 
{:string1=>"The cat sat on the mat",
 :string2=>"The number 42 is the answer to the ultimate question of life, the universe, and everything",
 :method=>"manhattan",          
 :output=>19.688237328809972}   
irb(main):010:0> params={string1:a,string2:b, method:'euclidean'}
=> 
{:string1=>"The cat sat on the mat",
...                             
irb(main):011:0> Embeddings.test_strings(params)
=> 
{:string1=>"The cat sat on the mat",
 :string2=>"The number 42 is the answer to the ultimate question of life, the universe, and everything",
 :method=>"euclidean",
 :output=>0.7841515624597036}

Since the dot product and the cosine similarity methods are the same (within a rounding error), these numbers match / confirm by different methods. Also, the euclidian distance is as expected relative to the dot product (and of course the cosine similarity, for the unit vector).

Those were the numbers I got when I tried to removing the background bias (which was the average of a lot of different samples).

1 Like

So, if I understand you correctly, you averaged the elements of some large number of vectors, for example like this simple example in Ruby:

vectors = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]

# initialize the accumulator with the first vector
accumulator = vectors[0]

# loop through the rest of the vectors
(1...vectors.size).each do |i|
    # add the current vector to the accumulator
    accumulator = accumulator.zip(vectors[i]).map{|x, y| x + y}
end

# calculate the average by dividing the accumulator by the number of vectors
average = accumulator.map{|x| x.to_f/vectors.size}

Testing example:

average = accumulator.map{|x| x.to_f/vectors.size}
=> [6.0, 7.0, 8.0, 9.0, 10.0]

Then you subtracted the results from each vector before you calculated the dot product / cosine similarity.

Is that correct?

Thanks.

Yes, exactly that :). I was looking at the graphs of the data and there were large spikes, and even the general shape of the data was the same. It seemed to be very consistent regardless of what I tried, so I wondered if subtracting that out to see if it made more sense.
In a way it does, but then again maybe the 196th index is very important and I’m not at the point of understanding why.

1 Like

OK. Thanks @debreuil

I will try this method, which after some research I found this method to be called “centering” the vectors :). I guess there are other names for it. I will add this method to my test harness when I get a chance.

Thanks!

Example “centering” the vectors …

vectors = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]
average = vectors.reduce(:+).map{|x| x.to_f/vectors.size}
debiased_vectors = vectors.map{|v| v.zip(average).map{|x, y| x - y}}

My final question is:

Where did you get the vectors you used to calculate the average and how many did vectors did you average?

Thanks.

Didn’t know that - interesting. Thank you too :slight_smile:

Oh and the vectors were from list I got chatGPT to generate. I’m trying to get vectors for primitive gradients, like inanimate->animate, cold->hot, still->fast.

So I had it generate a bunch of these opposite sentences, and was surprised they didn’t seem to have much of a signal. I then asked it to gen sentence pairs that would have a high cosine similarity and then low ones. I mostly used those.

I’ve since added math, html, languages, one space, very long, etc, but the bias seems similar. I’m sure it can be better than I’ve done with a more random sampling though.

1 Like

Chatty told me, so maybe it’s an AI hallucination. haha

My (maybe) final question is:

Where did you get the vectors you used to calculate the average and how many did vectors did you average?

Thanks.

Lol, chatty’s word over mine in this domain, any day :slight_smile:

(oh and saw your edit, answered above)

Chatty is a very good hallucinator, Timothy Leary would be proud :slight_smile:

2 Likes

Subtracting off the mean and then correlating is essentially taking the covariance. And visually, it appears correlated in many spots, not even just the higher “spikey” spots.

So you are definitely starting to answer why the embedding space is so focused along a few dimensions, instead of varying around the entire hyper-dimensional unit sphere of 1500+ dimensions.

As for any technical answer coming from ChatGPT, I would ignore it.

There are people that hook GPT-3 (text-davinci-003 or 002) up with Wolfram Alpha and get it to answer more math related questions. You probably need some classifiers ahead of this to put it in “math mode” or “not math mode”. The classifier could even be a fine-tuned model of GPT-3, such as babbage or curie.

2 Likes

OK, here is the solution. It is basically can be solved by post processing. Apparently this is a problem for trained embeddings out-of-the gate. The technical term for what ada-002 is is that it isn’t isotropic. One big vector is taking over the space and essentially reducing ada-002’s potential (dimensionality). Post-processing can improve this. Now, the paper shows the improvements are slight (2.3%), but it can be done.

2 Likes

Very interesting paper, subtracting the mean makes sense, and interesting they talk of dimensionality reduction as well. I will give that a try, would be nice to get a good publicly available version of that.
All these links are very helpful, thanks for taking the time going over these things.

2 Likes

Once I get some time, I was going to run PCA on the data, just like “Algorithm 1” in the paper and see what I get. I will report back here.

2 Likes

Reading through that paper, it made me think these embeddings might be encoding how common a word is as well as semantic meaning. I’m only interested in the conceptual meaning for what I’m doing, so I wanted to verify that. And try to subtract that out as well if it’s true.

From my initial tests, it seems that is true. I have 100 sentences made by giving chatGPT a list of the 50 most common words, and another list to avoid them (as well as pronouns etc). Examples are:

He is a good man.
The flowers in the garden are beautiful.
vs
Ravaged city bears scars of war.
Velociraptors roamed prehistoric savannah.

I made them all isomorphic, and made an image from the sums of their embeddings (more red is more positive, more blue is more negative). At least with this test it is clear the common words (first image) have generally lower values, and uncommon words a higher values. These images are just normalized 48x32 images made from the 1536 embedding values directly.

image

This is a first pass, but I think there is a signal there. It makes sense the word frequency is embedded, but the fact that common tends to be low seems a bit surprising.

1 Like