It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

LinqLover · February 7, 2024, 8:26pm

I was surprised too about the loss of quality for typical RAG/similarity search use cases when reducing the number of dimensions. When going from 3072 dimensions to 256 dimensions (which according to the blog post still should have a higher MTEB score than ada-002), the similarity score for an average document in my corpus against a reference document has an error of +/-0.05, which impacts the selection of the top-n results for a query significantly. Even for 1024 dimensions, the error is close to +/-0.02:

Source

comments := (self systemNavigation allClasses collect: #comment) reject: #isEmptyOrNil.

corpus := SemanticSimpleCorpus new.
corpus embeddingModel: model.
corpus addAllDocuments: (comments collect: [:ea | SemanticSimpleDocument forText: ea]).
corpus updateEmbeddings.

someComments := 50 timesCollect: [trueComments atRandom].

allDeltas := #(3072 3071 3070 3000 2500 2000 1536 1024 512 256)
	collect: [:size |
		compactCorpus := corpus withEmbeddingsCompactedToSize: size.
		someDeltas := someComments collect: [:someComment |
			someEmbedding := corpus documentForObject: someComment.
			someCompactEmbedding := compactCorpus documentForObject: someComment.

			dists := (corpus documents collect: [:ea | ea object asString -> (ea embedding dot: someEmbedding)] as: Dictionary) withKeysSorted values.
			compactDists := (compactCorpus documents collect: [:ea | ea object asString -> (ea embedding dot: someCompactEmbedding)] as: Dictionary) withKeysSorted values.

			dists - compactDists].
		deltas := someDeltas abs average.
		size -> deltas]
	as: Dictionary.

allDeltas collect: [:delta | delta abs median].

So my preliminary conclusion is that when you can afford it performance-wise, using all the available dimensions is still very favorable. Now I’m thinking about using the truncated embeddings for prefiltering documents before reading in the full embeddings from disk (as they don’t fit all in the main memory) to speed up my vector DB …

Topic		Replies	Views
Why `OpenAI Embedding` return different vectors for the same text input? API	35	9634	April 30, 2024
Which database tools suit for storing embeddings generated by the Embedding endpoint? API	46	25349	December 13, 2023
Some questions about text-embedding-ada-002’s embedding API	146	42614	December 13, 2023
Can text-embedding-ada-002 be made deterministic? API embeddings , ada	18	7359	December 24, 2023
Embeddings: The average and extreme values within dimensions of 3-large API	0	78	November 21, 2024

It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

Related topics