Some questions about text-embedding-ada-002’s embedding

Yeah, I’m confused by this. Using the current model: “text-similarity-ada-001”, the similarity numbers are quite different than above.

Then I run this, I get a very different result.

Check similarities:

irb(main):004:0> params={string1:a,string2:b, method:'cosine'}
=> 
{:string1=>"The cat sat on the mat",                        
...                                                         
irb(main):005:0> Embeddings.test_strings(params)
=> 
{:string1=>"The cat sat on the mat",
 :string2=>"The number 42 is the answer to the ultimate question of life, the universe, and everything",
 :method=>"cosine",
 :output=>0.6925531623476415}
irb(main):006:0> params={string1:a,string2:b, method:'dot'}
=> 
{:string1=>"The cat sat on the mat",
...
irb(main):007:0> Embeddings.test_strings(params)
=> 
{:string1=>"The cat sat on the mat",
 :string2=>"The number 42 is the answer to the ultimate question of life, the universe, and everything",
 :method=>"dot",
 :output=>0.6925531596482191}

Check distances:

irb(main):008:0> params={string1:a,string2:b, method:'manhattan'}
=> 
{:string1=>"The cat sat on the mat",
...                             
irb(main):009:0> Embeddings.test_strings(params)
=> 
{:string1=>"The cat sat on the mat",
 :string2=>"The number 42 is the answer to the ultimate question of life, the universe, and everything",
 :method=>"manhattan",          
 :output=>19.688237328809972}   
irb(main):010:0> params={string1:a,string2:b, method:'euclidean'}
=> 
{:string1=>"The cat sat on the mat",
...                             
irb(main):011:0> Embeddings.test_strings(params)
=> 
{:string1=>"The cat sat on the mat",
 :string2=>"The number 42 is the answer to the ultimate question of life, the universe, and everything",
 :method=>"euclidean",
 :output=>0.7841515624597036}

Since the dot product and the cosine similarity methods are the same (within a rounding error), these numbers match / confirm by different methods. Also, the euclidian distance is as expected relative to the dot product (and of course the cosine similarity, for the unit vector).

Those were the numbers I got when I tried to removing the background bias (which was the average of a lot of different samples).

1 Like

So, if I understand you correctly, you averaged the elements of some large number of vectors, for example like this simple example in Ruby:

vectors = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]

# initialize the accumulator with the first vector
accumulator = vectors[0]

# loop through the rest of the vectors
(1...vectors.size).each do |i|
    # add the current vector to the accumulator
    accumulator = accumulator.zip(vectors[i]).map{|x, y| x + y}
end

# calculate the average by dividing the accumulator by the number of vectors
average = accumulator.map{|x| x.to_f/vectors.size}

Testing example:

average = accumulator.map{|x| x.to_f/vectors.size}
=> [6.0, 7.0, 8.0, 9.0, 10.0]

Then you subtracted the results from each vector before you calculated the dot product / cosine similarity.

Is that correct?

Thanks.

Yes, exactly that :). I was looking at the graphs of the data and there were large spikes, and even the general shape of the data was the same. It seemed to be very consistent regardless of what I tried, so I wondered if subtracting that out to see if it made more sense.
In a way it does, but then again maybe the 196th index is very important and I’m not at the point of understanding why.

1 Like

OK. Thanks @debreuil

I will try this method, which after some research I found this method to be called “centering” the vectors :). I guess there are other names for it. I will add this method to my test harness when I get a chance.

Thanks!

Example “centering” the vectors …

vectors = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]
average = vectors.reduce(:+).map{|x| x.to_f/vectors.size}
debiased_vectors = vectors.map{|v| v.zip(average).map{|x, y| x - y}}

My final question is:

Where did you get the vectors you used to calculate the average and how many did vectors did you average?

Thanks.

Didn’t know that - interesting. Thank you too :slight_smile:

Oh and the vectors were from list I got chatGPT to generate. I’m trying to get vectors for primitive gradients, like inanimate->animate, cold->hot, still->fast.

So I had it generate a bunch of these opposite sentences, and was surprised they didn’t seem to have much of a signal. I then asked it to gen sentence pairs that would have a high cosine similarity and then low ones. I mostly used those.

I’ve since added math, html, languages, one space, very long, etc, but the bias seems similar. I’m sure it can be better than I’ve done with a more random sampling though.

1 Like

Chatty told me, so maybe it’s an AI hallucination. haha

My (maybe) final question is:

Where did you get the vectors you used to calculate the average and how many did vectors did you average?

Thanks.

Lol, chatty’s word over mine in this domain, any day :slight_smile:

(oh and saw your edit, answered above)

Chatty is a very good hallucinator, Timothy Leary would be proud :slight_smile:

2 Likes

Subtracting off the mean and then correlating is essentially taking the covariance. And visually, it appears correlated in many spots, not even just the higher “spikey” spots.

So you are definitely starting to answer why the embedding space is so focused along a few dimensions, instead of varying around the entire hyper-dimensional unit sphere of 1500+ dimensions.

As for any technical answer coming from ChatGPT, I would ignore it.

There are people that hook GPT-3 (text-davinci-003 or 002) up with Wolfram Alpha and get it to answer more math related questions. You probably need some classifiers ahead of this to put it in “math mode” or “not math mode”. The classifier could even be a fine-tuned model of GPT-3, such as babbage or curie.

2 Likes

OK, here is the solution. It is basically can be solved by post processing. Apparently this is a problem for trained embeddings out-of-the gate. The technical term for what ada-002 is is that it isn’t isotropic. One big vector is taking over the space and essentially reducing ada-002’s potential (dimensionality). Post-processing can improve this. Now, the paper shows the improvements are slight (2.3%), but it can be done.

2 Likes

Very interesting paper, subtracting the mean makes sense, and interesting they talk of dimensionality reduction as well. I will give that a try, would be nice to get a good publicly available version of that.
All these links are very helpful, thanks for taking the time going over these things.

2 Likes

Once I get some time, I was going to run PCA on the data, just like “Algorithm 1” in the paper and see what I get. I will report back here.

2 Likes

Reading through that paper, it made me think these embeddings might be encoding how common a word is as well as semantic meaning. I’m only interested in the conceptual meaning for what I’m doing, so I wanted to verify that. And try to subtract that out as well if it’s true.

From my initial tests, it seems that is true. I have 100 sentences made by giving chatGPT a list of the 50 most common words, and another list to avoid them (as well as pronouns etc). Examples are:

He is a good man.
The flowers in the garden are beautiful.
vs
Ravaged city bears scars of war.
Velociraptors roamed prehistoric savannah.

I made them all isomorphic, and made an image from the sums of their embeddings (more red is more positive, more blue is more negative). At least with this test it is clear the common words (first image) have generally lower values, and uncommon words a higher values. These images are just normalized 48x32 images made from the 1536 embedding values directly.

image

This is a first pass, but I think there is a signal there. It makes sense the word frequency is embedded, but the fact that common tends to be low seems a bit surprising.

1 Like

OK, I coded up the algorithm and I’d say I got good results in preliminary testing. I now get cosine similarities that are positive, negative and zero across the embedding search space. The results seem to make sense too!

Only weird thing is that my max and min cosine similarities are ±0.1 instead of +/-1. I am only using the top 15 dimensions (D/100 for ada-002). Maybe this is reducing the energy somehow? Anyway, the relative correlations and anti-correlations seem to make sense. Plus my top correlations seem better than the original ones.

Will have to test more to see, but so far the algorithm in the paper seems to work!

1 Like

Hey @ruby_coder @debreuil
Here is the code I wrote to do this. Hope it helps.

import numpy as np
import sklearn.decomposition
import pickle
import time


# Apply 'Algorithm 1' to the ada-002 embeddings to make them isotropic, taken from the paper:
# ALL-BUT-THE-TOP: SIMPLE AND EFFECTIVE POST- PROCESSING FOR WORD REPRESENTATIONS
# Jiaqi Mu, Pramod Viswanath

# This uses Principal Component Analysis (PCA) to 'evenly distribute' the embedding vectors (make them isotropic)
# For more information on PCA, see https://jamesmccaffrey.wordpress.com/2021/07/16/computing-pca-using-numpy-without-scikit/


# get the file pointer of the pickle containing the embeddings
fp = open('/path/to/your/data/Embedding-Latest.pkl', 'rb')


# the embedding data here is a dict consisting of key / value pairs
# the key is the hash of the message (SHA3-256), the value is the embedding from ada-002 (array of dimension 1536)
# the hash can be used to lookup the orignal text in a database
E = pickle.load(fp) # load the data into memory

# seperate the keys (hashes) and values (embeddings) into seperate vectors
K = list(E.keys()) # vector of all the hash values 
X = np.array(list(E.values())) # vector of all the embeddings, converted to numpy arrays


# list the total number of embeddings
# this can be truncated if there are too many embeddings to do PCA on
print(f"Total number of embeddings: {len(X)}")

# get dimension of embeddings, used later
Dim = len(X[0])

# flash out the first few embeddings
print("First two embeddings are: ")
print(X[0]) 
print(f"First embedding length: {len(X[0])}")
print(X[1])
print(f"Second embedding length: {len(X[1])}")


# compute the mean of all the embeddings, and flash the result
mu = np.mean(X, axis=0) # same as mu in paper
print(f"Mean embedding vector: {mu}")
print(f"Mean embedding vector length: {len(mu)}")


# subtract the mean vector from each embedding vector ... vectorized in numpy
X_tilde = X - mu # same as v_tilde(w) in paper



# do the heavy lifting of extracting the principal components
# note that this is a function of the embeddings you currently have here, and this set may grow over time
# therefore the PCA basis vectors may change over time, and your final isotropic embeddings may drift over time
# but the drift should stabilize after you have extracted enough embedding data to characterize the nature of the embedding engine
print(f"Performing PCA on the normalized embeddings ...")
pca = sklearn.decomposition.PCA()  # new object
TICK = time.time() # start timer
pca.fit(X_tilde) # do the heavy lifting!
TOCK = time.time() # end timer
DELTA = TOCK - TICK

print(f"PCA finished in {DELTA} seconds ...")

# dimensional reduction stage (the only hyperparameter)
# pick max dimension of PCA components to express embddings
# in general this is some integer less than or equal to the dimension of your embeddings
# it could be set as a high percentile, say 95th percentile of pca.explained_variance_ratio_
# but just hardcoding a constant here
D = 15 # hyperparameter on dimension (out of 1536 for ada-002), paper recommeds D = Dim/100


# form the set of v_prime(w), which is the final embedding
# this could be vectorized in numpy to speed it up, but coding it directly here in a double for-loop to avoid errors and to be transparent
E_prime = dict() # output dict of the new embeddings
N = len(X_tilde)
N10 = round(N/10)
U = pca.components_ # set of PCA basis vectors, sorted by most significant to least significant
print(f"Shape of full set of PCA componenents {U.shape}")
U = U[0:D,:] # take the top D dimensions (or take them all if D is the size of the embedding vector)
print(f"Shape of downselected PCA componenents {U.shape}")
for ii in range(N):
    v_tilde = X_tilde[ii]
    v = X[ii]
    v_projection = np.zeros(Dim) # start to build the projection
    # project the original embedding onto the PCA basis vectors, use only first D dimensions
    for jj in range(D):
        u_jj = U[jj,:] # vector
        v_jj = np.dot(u_jj,v) # scaler
        v_projection += v_jj*u_jj # vector
    v_prime = v_tilde - v_projection # final embedding vector
    v_prime = v_prime/np.linalg.norm(v_prime) # create unit vector
    E_prime[K[ii]] = v_prime 

    if (ii%N10 == 0) or (ii == N-1):
        print(f"Finished with {ii+1} embeddings out of {N} ({round(100*ii/N)}% done)")


# save as new pickle
print("Saving new pickle ...")
embeddingName = '/path/to/your/data/Embedding-Latest-Isotropic.pkl'
with open(embeddingName, 'wb') as f:  # Python 3: open(..., 'wb')
    pickle.dump([E_prime,mu,U], f)
    print(embeddingName)

print("Done!")

# When working with live data with a new embedding from ada-002, be sure to tranform it first with this function before comparing it
#
# def projectEmbedding(v,mu,U):
#     v = np.array(v)
#     v_tilde = v - mu
#     v_projection = np.zeros(len(v)) # start to build the projection
#     # project the original embedding onto the PCA basis vectors, use only first D dimensions
#     for u in U:
#         v_jj = np.dot(u,v) # scaler
#         v_projection += v_jj*u # vector
#     v_prime = v_tilde - v_projection # final embedding vector
#     v_prime = v_prime/np.linalg.norm(v_prime) # create unit vector
#     return v_prime 
1 Like

OK, just realized why I wasn’t getting +/-1. I forgot to normalize back out to unit vectors. Updated the code where it creates the new embeddings and the transformation in the comments at the bottom.

Results look even better! The orthogonal values look as expected (in left field) and the correlated and non-correlated values make sense too.

2 Likes

This looks awesome, will try it out — thank you!

1 Like

Thanks @curt.kennedy and for the @ mention as well.

Will definitely look at you code and probably port it to Ruby and add this method to my OpenAI test harness.

Thanks again for sharing!

:slight_smile:

1 Like

It looks like I have something new to digest. Thanks for all the experiments. I figured I would wait until you came to some conclusions :wink:

1 Like