OK, I coded up the algorithm and I’d say I got good results in preliminary testing. I now get cosine similarities that are positive, negative and zero across the embedding search space. The results seem to make sense too!

Only weird thing is that my max and min cosine similarities are ±0.1 instead of +/-1. I am only using the top 15 dimensions (D/100 for ada-002). Maybe this is reducing the energy somehow? Anyway, the relative correlations and anti-correlations seem to make sense. Plus my top correlations seem better than the original ones.

Will have to test more to see, but so far the algorithm in the paper seems to work!

import numpy as np
import sklearn.decomposition
import pickle
import time
# Apply 'Algorithm 1' to the ada-002 embeddings to make them isotropic, taken from the paper:
# ALL-BUT-THE-TOP: SIMPLE AND EFFECTIVE POST- PROCESSING FOR WORD REPRESENTATIONS
# Jiaqi Mu, Pramod Viswanath
# This uses Principal Component Analysis (PCA) to 'evenly distribute' the embedding vectors (make them isotropic)
# For more information on PCA, see https://jamesmccaffrey.wordpress.com/2021/07/16/computing-pca-using-numpy-without-scikit/
# get the file pointer of the pickle containing the embeddings
fp = open('/path/to/your/data/Embedding-Latest.pkl', 'rb')
# the embedding data here is a dict consisting of key / value pairs
# the key is the hash of the message (SHA3-256), the value is the embedding from ada-002 (array of dimension 1536)
# the hash can be used to lookup the orignal text in a database
E = pickle.load(fp) # load the data into memory
# seperate the keys (hashes) and values (embeddings) into seperate vectors
K = list(E.keys()) # vector of all the hash values
X = np.array(list(E.values())) # vector of all the embeddings, converted to numpy arrays
# list the total number of embeddings
# this can be truncated if there are too many embeddings to do PCA on
print(f"Total number of embeddings: {len(X)}")
# get dimension of embeddings, used later
Dim = len(X[0])
# flash out the first few embeddings
print("First two embeddings are: ")
print(X[0])
print(f"First embedding length: {len(X[0])}")
print(X[1])
print(f"Second embedding length: {len(X[1])}")
# compute the mean of all the embeddings, and flash the result
mu = np.mean(X, axis=0) # same as mu in paper
print(f"Mean embedding vector: {mu}")
print(f"Mean embedding vector length: {len(mu)}")
# subtract the mean vector from each embedding vector ... vectorized in numpy
X_tilde = X - mu # same as v_tilde(w) in paper
# do the heavy lifting of extracting the principal components
# note that this is a function of the embeddings you currently have here, and this set may grow over time
# therefore the PCA basis vectors may change over time, and your final isotropic embeddings may drift over time
# but the drift should stabilize after you have extracted enough embedding data to characterize the nature of the embedding engine
print(f"Performing PCA on the normalized embeddings ...")
pca = sklearn.decomposition.PCA() # new object
TICK = time.time() # start timer
pca.fit(X_tilde) # do the heavy lifting!
TOCK = time.time() # end timer
DELTA = TOCK - TICK
print(f"PCA finished in {DELTA} seconds ...")
# dimensional reduction stage (the only hyperparameter)
# pick max dimension of PCA components to express embddings
# in general this is some integer less than or equal to the dimension of your embeddings
# it could be set as a high percentile, say 95th percentile of pca.explained_variance_ratio_
# but just hardcoding a constant here
D = 15 # hyperparameter on dimension (out of 1536 for ada-002), paper recommeds D = Dim/100
# form the set of v_prime(w), which is the final embedding
# this could be vectorized in numpy to speed it up, but coding it directly here in a double for-loop to avoid errors and to be transparent
E_prime = dict() # output dict of the new embeddings
N = len(X_tilde)
N10 = round(N/10)
U = pca.components_ # set of PCA basis vectors, sorted by most significant to least significant
print(f"Shape of full set of PCA componenents {U.shape}")
U = U[0:D,:] # take the top D dimensions (or take them all if D is the size of the embedding vector)
print(f"Shape of downselected PCA componenents {U.shape}")
for ii in range(N):
v_tilde = X_tilde[ii]
v = X[ii]
v_projection = np.zeros(Dim) # start to build the projection
# project the original embedding onto the PCA basis vectors, use only first D dimensions
for jj in range(D):
u_jj = U[jj,:] # vector
v_jj = np.dot(u_jj,v) # scaler
v_projection += v_jj*u_jj # vector
v_prime = v_tilde - v_projection # final embedding vector
v_prime = v_prime/np.linalg.norm(v_prime) # create unit vector
E_prime[K[ii]] = v_prime
if (ii%N10 == 0) or (ii == N-1):
print(f"Finished with {ii+1} embeddings out of {N} ({round(100*ii/N)}% done)")
# save as new pickle
print("Saving new pickle ...")
embeddingName = '/path/to/your/data/Embedding-Latest-Isotropic.pkl'
with open(embeddingName, 'wb') as f: # Python 3: open(..., 'wb')
pickle.dump([E_prime,mu,U], f)
print(embeddingName)
print("Done!")
# When working with live data with a new embedding from ada-002, be sure to tranform it first with this function before comparing it
#
# def projectEmbedding(v,mu,U):
# v = np.array(v)
# v_tilde = v - mu
# v_projection = np.zeros(len(v)) # start to build the projection
# # project the original embedding onto the PCA basis vectors, use only first D dimensions
# for u in U:
# v_jj = np.dot(u,v) # scaler
# v_projection += v_jj*u # vector
# v_prime = v_tilde - v_projection # final embedding vector
# v_prime = v_prime/np.linalg.norm(v_prime) # create unit vector
# return v_prime

OK, just realized why I wasn’t getting +/-1. I forgot to normalize back out to unit vectors. Updated the code where it creates the new embeddings and the transformation in the comments at the bottom.

Results look even better! The orthogonal values look as expected (in left field) and the correlated and non-correlated values make sense too.

You must take in account that the “similarity” is some kind of “distance” measure between 2 points in a space of more than 700 dimensions ??? LOOOL… so, in fact, trying to “project” the COMPLEXITY of the semantic MAP of such language models in a unique dimension is already too much optimistic feat. In the best case, it let us sort several “sentences/words” by “similarity”, as ranking.

Example for people with not too much math-geometry knowledge: if you have a 3 dimensions space, for example people living in a building, where each person is a point in a 3 axis space. Imagine there are only 3 neighbours:

A

B, in the same level than A, but in the next door

C, in the “same door” as A, but 10 levels above.

Which is really the neighbour closest to A? B.
But it you see the building from sky (as a bird eye), your calculations probably will say that A & C are very close, and B is further away.

In other words, If we have a space of 3 dimensions and we try to take rule and measure distances in a “2D see” (from sky), passing from 3 to 2 dimensions, then we’re loosing information.

Returning to embeddings, the similarity function is an effort to REDUCE the more than 700 dimensions of the ada model to a one unique dimension, to put in there each of the sentences to be compared and observe which is “closer” to other.

So, tell me what can go wrong, reducing 700 to a unique dimension to make “clusters”!? hehehehe.

Yep, the similarity function takes two multidimensional vectors and produces a single real number representing the similarity (or distance) between the two. Math is beautiful, isn’t it.

I saw that today but noticed it uses ada and not davinci. My experience is that none of the models are anywhere near davinci. Is this really better sinuano noche results or just better cost to performance

hi, sorry if i’m asking this question in this thread since the title used is more “general”.

So basically, I have a document in my native language (Bahasa). Then Embedding it using the API. I try to query the documents with my own language and got the top ranks that I wanted. After that, I wonder if asking the English question might return similar outputs or not. It kind of returned some of the similar documents before but with a different score of cosine similarity (a native language question gives a higher score than english question). I have my own answer but I just wanted a confirmation from here, does the embedding model translate our text first into English, or just straight up the process into the Embedding and return the vector?

Thank you for anyone who gladly will answer this question!

It embeds in the language your text is in. If you ask in English and have Bahasa text, the vectors will not match as well as English to English or Bahasa to Bahasa.

We did work with a mix of academic journals in English, French, German, Portuguese, Italian, and Spanish.

We asked the question in English. The English embedding always came up first.

We fixed it using GPT (curie model) to translate our English question into six languages. Then we did an embedding run for each language. We combined the final sets and picked the top rows overall. (When languages don’t match, the embedding/dot product score is generally low)

We used dot product for speed. It gives the same result as the cosine method because the vectors are normalized in GPT.

We ended up marking our embeddings with the language of the source document. That way we didn’t have to run through the entire set of embedding for each language run.

It worked very well BUT we found that languages that use more words to describe a concept, tended to score slightly higher. (Eg Portuguese to Portuguese was slightly higher than English to English for the exact same text embedding and question pairs)

If that didn’t make sense, ask me more questions, as we have done extensive testing on this.

oh wow, really didn’t expect you will include the detail of your use case. Really appreciate it, Big thanks man

so basically if Portuguese uses like 200 words and English uses only like 150 words to telling the same concept. Does that mean Portuguese will return a higher score by asking it on Portuguese rather than English to English?

by marking your embedding with their main language does that mean if User ask in Spanish, the search will only happen in the scope of embedding that is marked with Spanish?

How do you detect the user language used when they’re asking? do you use automatic language detection like langdetect, fasttext, some machine learning models, or simply by just letting the User set the parameter manually while they’re request their query?

Would graphics, sounds, speech tone/emotion, video, multimodal, gestures, facial expressions, place, time, weather, age, may matter for some use cases and require additional space?

@curt.kennedy Thank you for sharing you code!
I’ve created a custom module in weaviate that does the ADA+PCA but my issue it’s in first training the PCA with meaningful embeddings, how did you achieve this? How many entries of vectors did you fed it ( you initial ‘fp’) also were those vectors mostly data you expect to work with?

My ‘fp’ contained 63105 embeddings. All data from incoming SMS messages, so short, largely clustered data, with a variety of excursions. Think of the distribution as a bunch of sparse spikes, with lots of scattered background variation.

Hey @curt.kennedy !
Thanks for sharing the code above. Please help me with the following:-

Let’s say I have my embedding dataset, and I did PCA on top of it according to your code and the paper mentioned above. So, if I have a new user query coming in, what would be the process for finding similarities with the newly developed embedding dataset?

Do I need to add each query to the dataset and then apply PCA to find the similarity with the embeddings? or do I need to just reduce the mean and take the first ‘D’ dimensions?

I assume this might take some time to give a similar response back to the user in production. Please correct me if I am wrong.