Some questions about text-embedding-ada-002’s embedding

If you are fitting 95 vectors, you follow ‘Algorithm 1’ in the paper and you get 95 vectors with 1536 dimensions each. Here is ‘Algorithm 1’:

Because the paper doesn’t do this!

The paper is performing PCA on the unbiased collection of embedding vectors, this is the X_tilde variable in the code, which is defined as X - mu, where X is the set of embedding vectors you want to create a fit for, and mu is the average of these vectors.

Then you downselelect (with D) how many PCA vectors you want to represent your set of embeddings. So the code you quoted is essentially dropping unimportant information by re-expressing each embedding vector using the top 15 PCA vectors, which according to PCA principals, are the top 15 vectors that describe the most variance in the data set. So the remaining 1521 vectors that PCA computed are discarded, since the variance they express is small. But you still get a full 1536 dimension vector when you use the top 15 PCA basis vectors to represent each vector. (See ‘Algorithm 1’)

Also, 95 samples is probably on the low side. Here I fit and transformed 63105 different embedding vectors. So if it’s not clear, I put in 63105 vectors, each of length 1536, and got out 63105 vectors of length 1536. I put all these in a dictionary with 63105 key/value pairs where the value is the new embedding vector with 1536 dimensions and the key is the Hash of the underlying text that was embedded. This Hash is used to index the original text in another database, not shown in the code.

Total number of embeddings: 63105
First two embeddings are: 
[ 0.00418844 -0.0094733  -0.01019533 ... -0.0273703  -0.03086011
 -0.03326688]
First embedding length: 1536
[ 0.00499764  0.01619455  0.01353865 ... -0.00447941 -0.01553382
 -0.01027383]
Second embedding length: 1536
Mean embedding vector: [-0.0078587  -0.00596242 -0.0014923  ... -0.01967299 -0.01243456
 -0.02973328]
Mean embedding vector length: 1536
Performing PCA on the normalized embeddings ...
PCA finished in 8.220144271850586 seconds ...
Shape of full set of PCA componenents (1536, 1536)
Shape of downselected PCA componenents (15, 1536)
Finished with 1 embeddings out of 63105 (0% done)
Finished with 6311 embeddings out of 63105 (10% done)
Finished with 12621 embeddings out of 63105 (20% done)
Finished with 18931 embeddings out of 63105 (30% done)
Finished with 25241 embeddings out of 63105 (40% done)
Finished with 31551 embeddings out of 63105 (50% done)
Finished with 37861 embeddings out of 63105 (60% done)
Finished with 44171 embeddings out of 63105 (70% done)
Finished with 50481 embeddings out of 63105 (80% done)
Finished with 56791 embeddings out of 63105 (90% done)
Finished with 63101 embeddings out of 63105 (100% done)
Finished with 63105 embeddings out of 63105 (100% done)

Also, don’t forget that when a new embedding vector comes in, after you have all this fitted data, you have to transform it by using the function call to projectEmbedding(v,mu,U) where v is the original vector from ada-002, mu is the saved vector average of the original fit, and U is your collection of top PCA basis vectors (15 is the default). This is mentioned above here:

1 Like