I was surprised too about the loss of quality for typical RAG/similarity search use cases when reducing the number of dimensions. When going from 3072 dimensions to 256 dimensions (which according to the blog post still should have a higher MTEB score than ada-002), the similarity score for an average document in my corpus against a reference document has an error of +/-0.05, which impacts the selection of the top-n results for a query significantly. Even for 1024 dimensions, the error is close to +/-0.02:
Source
comments := (self systemNavigation allClasses collect: #comment) reject: #isEmptyOrNil.
corpus := SemanticSimpleCorpus new.
corpus embeddingModel: model.
corpus addAllDocuments: (comments collect: [:ea | SemanticSimpleDocument forText: ea]).
corpus updateEmbeddings.
someComments := 50 timesCollect: [trueComments atRandom].
allDeltas := #(3072 3071 3070 3000 2500 2000 1536 1024 512 256)
collect: [:size |
compactCorpus := corpus withEmbeddingsCompactedToSize: size.
someDeltas := someComments collect: [:someComment |
someEmbedding := corpus documentForObject: someComment.
someCompactEmbedding := compactCorpus documentForObject: someComment.
dists := (corpus documents collect: [:ea | ea object asString -> (ea embedding dot: someEmbedding)] as: Dictionary) withKeysSorted values.
compactDists := (compactCorpus documents collect: [:ea | ea object asString -> (ea embedding dot: someCompactEmbedding)] as: Dictionary) withKeysSorted values.
dists - compactDists].
deltas := someDeltas abs average.
size -> deltas]
as: Dictionary.
allDeltas collect: [:delta | delta abs median].
So my preliminary conclusion is that when you can afford it performance-wise, using all the available dimensions is still very favorable. Now I’m thinking about using the truncated embeddings for prefiltering documents before reading in the full embeddings from disk (as they don’t fit all in the main memory) to speed up my vector DB …