OK, here is the solution. It is basically can be solved by post processing. Apparently this is a problem for trained embeddings out-of-the gate. The technical term for what ada-002 is is that it isn’t * isotropic*. One big vector is taking over the space and essentially reducing ada-002’s potential (dimensionality). Post-processing can improve this. Now, the paper shows the improvements are slight (2.3%), but it can be done.

Very interesting paper, subtracting the mean makes sense, and interesting they talk of dimensionality reduction as well. I will give that a try, would be nice to get a good publicly available version of that.

All these links are very helpful, thanks for taking the time going over these things.

Once I get some time, I was going to run PCA on the data, just like “Algorithm 1” in the paper and see what I get. I will report back here.

Reading through that paper, it made me think these embeddings might be encoding how common a word is as well as semantic meaning. I’m only interested in the conceptual meaning for what I’m doing, so I wanted to verify that. And try to subtract that out as well if it’s true.

From my initial tests, it seems that is true. I have 100 sentences made by giving chatGPT a list of the 50 most common words, and another list to avoid them (as well as pronouns etc). Examples are:

He is a good man.

The flowers in the garden are beautiful.

vs

Ravaged city bears scars of war.

Velociraptors roamed prehistoric savannah.

I made them all isomorphic, and made an image from the sums of their embeddings (more red is more positive, more blue is more negative). At least with this test it is clear the common words (first image) have generally lower values, and uncommon words a higher values. These images are just normalized 48x32 images made from the 1536 embedding values directly.

This is a first pass, but I think there is a signal there. It makes sense the word frequency is embedded, but the fact that common tends to be low seems a bit surprising.

OK, I coded up the algorithm and I’d say I got good results in preliminary testing. I now get cosine similarities that are positive, negative and zero across the embedding search space. The results seem to make sense too!

Only weird thing is that my max and min cosine similarities are ±0.1 instead of +/-1. I am only using the top 15 dimensions (D/100 for ada-002). Maybe this is reducing the energy somehow? Anyway, the relative correlations and anti-correlations seem to make sense. Plus my top correlations seem better than the original ones.

Will have to test more to see, but so far the algorithm in the paper seems to work!

Hey @ruby_coder @debreuil

Here is the code I wrote to do this. Hope it helps.

```
import numpy as np
import sklearn.decomposition
import pickle
import time
# Apply 'Algorithm 1' to the ada-002 embeddings to make them isotropic, taken from the paper:
# ALL-BUT-THE-TOP: SIMPLE AND EFFECTIVE POST- PROCESSING FOR WORD REPRESENTATIONS
# Jiaqi Mu, Pramod Viswanath
# This uses Principal Component Analysis (PCA) to 'evenly distribute' the embedding vectors (make them isotropic)
# For more information on PCA, see https://jamesmccaffrey.wordpress.com/2021/07/16/computing-pca-using-numpy-without-scikit/
# get the file pointer of the pickle containing the embeddings
fp = open('/path/to/your/data/Embedding-Latest.pkl', 'rb')
# the embedding data here is a dict consisting of key / value pairs
# the key is the hash of the message (SHA3-256), the value is the embedding from ada-002 (array of dimension 1536)
# the hash can be used to lookup the orignal text in a database
E = pickle.load(fp) # load the data into memory
# seperate the keys (hashes) and values (embeddings) into seperate vectors
K = list(E.keys()) # vector of all the hash values
X = np.array(list(E.values())) # vector of all the embeddings, converted to numpy arrays
# list the total number of embeddings
# this can be truncated if there are too many embeddings to do PCA on
print(f"Total number of embeddings: {len(X)}")
# get dimension of embeddings, used later
Dim = len(X[0])
# flash out the first few embeddings
print("First two embeddings are: ")
print(X[0])
print(f"First embedding length: {len(X[0])}")
print(X[1])
print(f"Second embedding length: {len(X[1])}")
# compute the mean of all the embeddings, and flash the result
mu = np.mean(X, axis=0) # same as mu in paper
print(f"Mean embedding vector: {mu}")
print(f"Mean embedding vector length: {len(mu)}")
# subtract the mean vector from each embedding vector ... vectorized in numpy
X_tilde = X - mu # same as v_tilde(w) in paper
# do the heavy lifting of extracting the principal components
# note that this is a function of the embeddings you currently have here, and this set may grow over time
# therefore the PCA basis vectors may change over time, and your final isotropic embeddings may drift over time
# but the drift should stabilize after you have extracted enough embedding data to characterize the nature of the embedding engine
print(f"Performing PCA on the normalized embeddings ...")
pca = sklearn.decomposition.PCA() # new object
TICK = time.time() # start timer
pca.fit(X_tilde) # do the heavy lifting!
TOCK = time.time() # end timer
DELTA = TOCK - TICK
print(f"PCA finished in {DELTA} seconds ...")
# dimensional reduction stage (the only hyperparameter)
# pick max dimension of PCA components to express embddings
# in general this is some integer less than or equal to the dimension of your embeddings
# it could be set as a high percentile, say 95th percentile of pca.explained_variance_ratio_
# but just hardcoding a constant here
D = 15 # hyperparameter on dimension (out of 1536 for ada-002), paper recommeds D = Dim/100
# form the set of v_prime(w), which is the final embedding
# this could be vectorized in numpy to speed it up, but coding it directly here in a double for-loop to avoid errors and to be transparent
E_prime = dict() # output dict of the new embeddings
N = len(X_tilde)
N10 = round(N/10)
U = pca.components_ # set of PCA basis vectors, sorted by most significant to least significant
print(f"Shape of full set of PCA componenents {U.shape}")
U = U[0:D,:] # take the top D dimensions (or take them all if D is the size of the embedding vector)
# U = U[D:,:] # take All But The Top!
print(f"Shape of downselected PCA componenents {U.shape}")
for ii in range(N):
v_tilde = X_tilde[ii]
v = X[ii]
v_projection = np.zeros(Dim) # start to build the projection
# project the original embedding onto the PCA basis vectors, use only first D dimensions
for jj in range(D):
u_jj = U[jj,:] # vector
v_jj = np.dot(u_jj,v) # scaler
v_projection += v_jj*u_jj # vector
v_prime = v_tilde - v_projection # final embedding vector
v_prime = v_prime/np.linalg.norm(v_prime) # create unit vector
E_prime[K[ii]] = v_prime
if (ii%N10 == 0) or (ii == N-1):
print(f"Finished with {ii+1} embeddings out of {N} ({round(100*ii/N)}% done)")
# save as new pickle
print("Saving new pickle ...")
embeddingName = '/path/to/your/data/Embedding-Latest-Isotropic.pkl'
with open(embeddingName, 'wb') as f: # Python 3: open(..., 'wb')
pickle.dump([E_prime,mu,U], f)
print(embeddingName)
print("Done!")
# When working with live data with a new embedding from ada-002, be sure to tranform it first with this function before comparing it
#
# def projectEmbedding(v,mu,U):
# v = np.array(v)
# v_tilde = v - mu
# v_projection = np.zeros(len(v)) # start to build the projection
# # project the original embedding onto the PCA basis vectors, use only first D dimensions
# for u in U:
# v_jj = np.dot(u,v) # scaler
# v_projection += v_jj*u # vector
# v_prime = v_tilde - v_projection # final embedding vector
# v_prime = v_prime/np.linalg.norm(v_prime) # create unit vector
# return v_prime
```

OK, just realized why I wasn’t getting +/-1. I forgot to normalize back out to unit vectors. Updated the code where it creates the new embeddings and the transformation in the comments at the bottom.

Results look even better! The orthogonal values look as expected (in left field) and the correlated and non-correlated values make sense too.

This looks awesome, will try it out — thank you!

Thanks @curt.kennedy and for the @ mention as well.

Will definitely look at you code and probably port it to Ruby and add this method to my OpenAI test harness.

Thanks again for sharing!

It looks like I have something new to digest. Thanks for all the experiments. I figured I would wait until you came to some conclusions

You must take in account that the “similarity” is some kind of “distance” measure between 2 points in a space of more than 700 dimensions ??? LOOOL… so, in fact, trying to “project” the COMPLEXITY of the semantic MAP of such language models in a unique dimension is already too much optimistic feat. In the best case, it let us sort several “sentences/words” by “similarity”, as ranking.

Example for people with not too much math-geometry knowledge: if you have a 3 dimensions space, for example people living in a building, where each person is a point in a 3 axis space. Imagine there are only 3 neighbours:

- A
- B, in the same level than A, but in the next door
- C, in the “same door” as A, but 10 levels above.

Which is really the neighbour closest to A? B.

But it you see the building from sky (as a bird eye), your calculations probably will say that A & C are very close, and B is further away.

In other words, If we have a space of 3 dimensions and we try to take rule and measure distances in a “2D see” (from sky), passing from 3 to 2 dimensions, then we’re loosing information.

Returning to embeddings, the similarity function is an effort to REDUCE the more than 700 dimensions of the ada model to a one unique dimension, to put in there each of the sentences to be compared and observe which is “closer” to other.

So, tell me what can go wrong, reducing 700 to a unique dimension to make “clusters”!? hehehehe.

Yep, the similarity function takes two multidimensional vectors and produces a single real number representing the similarity (or distance) between the two. Math is beautiful, isn’t it.

I saw that today but noticed it uses ada and not davinci. My experience is that none of the models are anywhere near davinci. Is this really better sinuano noche results or just better cost to performance

For embeddings, ada-002 is supposed to be better than Davinci.

hi, sorry if i’m asking this question in this thread since the title used is more “general”.

So basically, I have a document in my native language (Bahasa). Then Embedding it using the API. I try to query the documents with my own language and got the top ranks that I wanted. After that, I wonder if asking the English question might return similar outputs or not. It kind of returned some of the similar documents before but with a different score of cosine similarity (a native language question gives a higher score than english question). I have my own answer but I just wanted a confirmation from here, does the embedding model translate our text first into English, or just straight up the process into the Embedding and return the vector?

Thank you for anyone who gladly will answer this question!

It embeds in the language your text is in. If you ask in English and have Bahasa text, the vectors will not match as well as English to English or Bahasa to Bahasa.

We did work with a mix of academic journals in English, French, German, Portuguese, Italian, and Spanish.

We asked the question in English. The English embedding always came up first.

We fixed it using GPT (curie model) to translate our English question into six languages. Then we did an embedding run for each language. We combined the final sets and picked the top rows overall. (When languages don’t match, the embedding/dot product score is generally low)

We used dot product for speed. It gives the same result as the cosine method because the vectors are normalized in GPT.

We ended up marking our embeddings with the language of the source document. That way we didn’t have to run through the entire set of embedding for each language run.

It worked very well BUT we found that languages that use more words to describe a concept, tended to score slightly higher. (Eg Portuguese to Portuguese was slightly higher than English to English for the exact same text embedding and question pairs)

If that didn’t make sense, ask me more questions, as we have done extensive testing on this.

oh wow, really didn’t expect you will include the detail of your use case. Really appreciate it, Big thanks man

so basically if Portuguese uses like 200 words and English uses only like 150 words to telling the same concept. Does that mean Portuguese will return a higher score by asking it on Portuguese rather than English to English?

Yes but it is a very small difference and didn’t cause issues in the end.

so sorry if I’m asking more questions

by marking your embedding with their main language does that mean if User ask in Spanish, the search will only happen in the scope of embedding that is marked with Spanish?

How do you detect the user language used when they’re asking? do you use automatic language detection like `langdetect`

, `fasttext`

, some machine learning models, or simply by just letting the User set the parameter manually while they’re request their query?

Sorry in advanced if this is too much for you

I let the user pick the language when they uploaded the document

I embedded in the source language

When a question was asked I translated it into the languages of the documents I knew I had available to search

Then I searched all the English ones with the English translation

Then I searched all the Spanish ones with the Spanish translation

Then I combined the two lists of top dot products into one list and took the took x rows

Finally I asked the question in English but I supplied the contexts to the prompt in their original language

Gpt was fine with mixing languages in the prompt. It was just the semantic search that was modified