It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

I was hacking around with the new embedding models and hypothesized they were all inherited from the larger dimensional version.

This looks to be true.

import requests
import numpy as np

Msg0 = "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe."

Payload0 = {
  "model": "text-embedding-3-large",
  "input": Msg0,
  "dimensions": 1024
}

Payload1 = {
  "model": "text-embedding-3-large",
  "input": Msg0,
  "dimensions": 3072
}

HEADERS = {"Authorization": f"Bearer {YOUR_OPENAI_KEY}",
            "Content-Type": "application/json"}

r = requests.post("https://api.openai.com/v1/embeddings",json=Payload0,headers=HEADERS)
q = r.json()
Embedding0 = q['data'][0]['embedding']

r = requests.post("https://api.openai.com/v1/embeddings",json=Payload1,headers=HEADERS)
q = r.json()
Embedding1 = q['data'][0]['embedding']

v0 = np.array(Embedding0)
v1 = np.array(Embedding1)
v1 = v1[0:1024] # truncate the large model at higher dims to match the low dim model
v1 = v1/np.linalg.norm(v1) # re-scale back out to unit hypersphere.
print('Magnitude of the Vector v0:', np.linalg.norm(v0))
print('Magnitude of the Vector v1:', np.linalg.norm(v1))

c = np.dot(v0,v1)

print(f"Dot Product: {c}")

# Example output showing similarity
# Magnitude of the Vector v0: 1.0000000393255921
# Magnitude of the Vector v1: 1.0
# Dot Product: 0.9999992894122343

So if you want to create an arbitrary dimensional version, within either large or small, you just take the embedding at the dimension higher, and then truncate/scale the vector to produce the desired vector.

So if you want a 512 dimensional embedding from 3-large, you just create either the 1024 or 3072 version, and just truncate and re-scale to a unit vector.

I confirmed you cannot mix large/small though, as evidenced here:

Magnitude of the Vector v0: 0.9999999908481855
Magnitude of the Vector v1: 0.9999999999999998
Dot Product: 0.010467056078529848

So this trick can only be applied within the model (large or small), not mixed between the two.

TL;DR You can save API costs by embedding once at the higher dimension, for a given model, and synthesize the other dimensions offline, including dimensions not offered by the API.

PS. Your MTEB scores will just gradually diminish as you truncate/scale, as expected, per dim dropped.

16 Likes

It’s a cool idea, using a mask to pre-compute distances at smaller dimensions. If you’re cpu bound, you could just abort the cosine similarity operation if you’re not approaching your cutoff after N dims :thinking:

2 Likes

Yes you could mask, but you have to re-scale the un-masked sub-vector length back to one before you can use cosine similarity (dot product). This is only 1 additional number in the memory structure, per vector, after you mask, so not a big deal. :rofl:

It’s just that the new models are all inherited from the same large “mother model” and can be easily derived from this big model.

I wasn’t sure how this affects MTEB, but the publication suggests a gradual degradation as the dimensions are dropped. Which implies similar for arbitrary dimensions.

What this suggests, is that OpenAI spread the information out in such a way that the beginning of the vector contains the most information, and it either gradually decreases as you go larger in dimensions, or maybe it’s uniform across, I can’t tell. Maybe it is tapered to give good low dim performance.

But the information is not just concentrated at the far end of the large vector, which is why they can simply truncate.

You might be tempted to try this with other embedding models, and it might work there too, but if the information is concentrated somewhere unexpected in the vector, and not engineered correctly, like these OpenAI vectors seem to be, you could get weird results.

From a DevOps perspective. Just embed everything at 3072 (or whatever your highest dim is), and when you create your in-memory shards, just truncate/rescale to create a system with that dim.
And for incoming queries, do the same, truncate/rescale to match your set of RAG vectors at your chosen arbitrary dimension.

This will allow you to tune your latencies, and trade quality, at an arbitrary 1-dim-at-a-time granularity, which is pretty insane!

2 Likes

So…

How long until one of us tries to figure out the cone angle of the embedding space?

3 Likes

This reminds me of another thing we have to check, is whether or not these new models have really correlated embeddings or not.

This will influence your thresholds for “non-correlation”, as they used to be close to <0.9 for ada-002.

But now the spaces might be more isotropic and the cones are bigger. We’ll see!

2 Likes

Reading this, my first thought is to not do truncation, rather if you have a large enough set of embedded vectors at the highest dimension which all come from the same (or similar) knowledge domain, you could potentially employ some standard (but more sophisticated than truncation) dimension-reduction techniques.

This may have the benefit of keeping relatively high performance at lower dimensions for embeddings within that knowledge domain.

And, since you’re keeping all of the raw, full-dimension, vectors you’d be able to periodically recompute the vector reductions offline as you gather more embeddings…

2 Likes

I agree that truncation seems rather, taboo, personally.

But OpenAI has engineered these new embedding models such that truncation works, and is what they are currently doing to carve the new lower dimensions. Probably to save compute costs.

Plus, truncation is easy. :sweat_smile:

But doing something fancier, like PCA, could work too, it’s just more work. But it could be worth it.

PCA was my solution out of the “little cone” problem of ada-002. But I didn’t operationalize it, since tightening my correlation bounds was easy and worked just fine.

But dimension reduction should be on the table. I, personally, just have MTEB insecurities here if not doing truncation. As these models appear to be MTEB robust against truncation.

1 Like

Started monkeying with a bit of my “playground” and 3-large…

== Sample from 3 d=1024 vectors returned ==
0 [‘-0.048355922’, ‘0.030448081’, ‘-0.002223636’, ‘-0.012373986’]
1 [‘0.000544521’, ‘-0.052708060’, ‘0.037970472’, ‘0.024555754’]
2 [‘-0.018499656’, ‘-0.023828303’, ‘0.041195750’, ‘-0.011031237’]

== Sample from 3 d=3072 vectors returned ==
0 [‘-0.035144586’, ‘0.022129351’, ‘-0.001616116’, ‘-0.008993286’]
1 [‘0.000397846’, ‘-0.038510315’, ‘0.027742529’, ‘0.017941277’]
2 [‘-0.013463791’, ‘-0.017341908’, ‘0.029981693’, ‘-0.008028381’]

== Cosine similarity and vector comparison of all inputs @ 3072 ==
0:“Jet pack” <==> 1:“tonal language”:
0.0293138009386300 - identical: False
0:“Jet pack” <==> 2:“It’s greased lightning!”:
0.2810681077365715 - identical: False
1:“tonal language” <==> 2:“It’s greased lightning!”:
0.0498102101138038 - identical: False

So yes, the 1024 are extracted right from the initial 3072. We can see the same sign and similar magnitudes.

I wonder if they did any remapping of highest relevance dimensions, as it was found that a lot of distinguishment was found in the middle 1/3 in other embeddings models.


A very different range of cosine similarity values than the thresholds ~0.8 of ada-002:

== Cosine similarity and vector comparison of all inputs @ 1536 ==
0:“Jet pack” <==> 1:“Jet pack”:
1.0000000000000002 - identical: True
0:“Jet pack” <==> 2:“tonal language”:
0.7175784894183529 - identical: False
0:“Jet pack” <==> 3:“It’s greased lightning!”:
0.7802673329570928 - identical: False
1:“Jet pack” <==> 2:“tonal language”:
0.7175784894183529 - identical: False
1:“Jet pack” <==> 3:“It’s greased lightning!”:
0.7802673329570928 - identical: False
2:“tonal language” <==> 3:“It’s greased lightning!”:
0.7006591403724325 - identical: False


PS; API base64 → 32 bit float single, then math with double. Display vectors 250-254 rounded from ~16 digits.


def cosine_similarity(asingle, bsingle) -> np.double:
    """return normalized dot product of two arrays"""
    a = asingle.astype(np.double)
    b = bsingle.astype(np.double)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
4 Likes

I think there was some re-mapping. Which is why they are truncating to produce the different dimensions. It’s not naive truncation, it’s smart truncation, which is why I’m thinking these models are MTEB robust against truncation. It’s engineered that way.

How would they do this? It’s hard to say, but once they figure out the sensitive dimensions in the model, they keep those locked for the lowest dimension available in the API. Then they form another set, for the next higher dimension, and so on.

These can be achieved by simple permutations on the layer producing the embeddings. And whatever sensitivity analysis went into the dimensional slicing.

If I were to speculate, they just went MTEB testing crazy, and threw a dart at the fuzzy dart board :rofl:

It would be cool if there was some systematic way to perfectly order your embedding dimensions, from most important to least important… Maybe it’s a PCA analysis, yeah, that’s probably it. Just do that and get your ordering, and partition from there?

4 Likes

ah, as described in the seminal paper, “Training on Validation Data Is All You Need”

1 Like

Link or it doesn’t exist.

I couldn’t find an exact match to such a paper with my 20 seconds of searching.

2 Likes

sorry that was a joke. :sweat_smile:

if you use the validation/evaluation method to tweak your model you’re gonna get the best results in validation/evaluation. But the paper is real-ish:

1 Like

OK good. Whew. :sweat_smile:

My first comeback was going to be that while the model is awesome at benchmarks, there is no way it generalizes and is actually useful to anyone.

Good to see joke papers out there floating around :rofl:

2 Likes

I was surprised too about the loss of quality for typical RAG/similarity search use cases when reducing the number of dimensions. When going from 3072 dimensions to 256 dimensions (which according to the blog post still should have a higher MTEB score than ada-002), the similarity score for an average document in my corpus against a reference document has an error of +/-0.05, which impacts the selection of the top-n results for a query significantly. Even for 1024 dimensions, the error is close to +/-0.02:

Source
comments := (self systemNavigation allClasses collect: #comment) reject: #isEmptyOrNil.

corpus := SemanticSimpleCorpus new.
corpus embeddingModel: model.
corpus addAllDocuments: (comments collect: [:ea | SemanticSimpleDocument forText: ea]).
corpus updateEmbeddings.

someComments := 50 timesCollect: [trueComments atRandom].

allDeltas := #(3072 3071 3070 3000 2500 2000 1536 1024 512 256)
	collect: [:size |
		compactCorpus := corpus withEmbeddingsCompactedToSize: size.
		someDeltas := someComments collect: [:someComment |
			someEmbedding := corpus documentForObject: someComment.
			someCompactEmbedding := compactCorpus documentForObject: someComment.

			dists := (corpus documents collect: [:ea | ea object asString -> (ea embedding dot: someEmbedding)] as: Dictionary) withKeysSorted values.
			compactDists := (compactCorpus documents collect: [:ea | ea object asString -> (ea embedding dot: someCompactEmbedding)] as: Dictionary) withKeysSorted values.

			dists - compactDists].
		deltas := someDeltas abs average.
		size -> deltas]
	as: Dictionary.

allDeltas collect: [:delta | delta abs median].

So my preliminary conclusion is that when you can afford it performance-wise, using all the available dimensions is still very favorable. Now I’m thinking about using the truncated embeddings for prefiltering documents before reading in the full embeddings from disk (as they don’t fit all in the main memory) to speed up my vector DB …

I’m not a Smalltalk expert, but it’s not clear you are normalizing the truncated embeddings. So something like:

someEmbeddingNormalized := someEmbedding / (someEmbedding norm).
someCompactEmbeddingNormalized := someCompactEmbedding / (someCompactEmbedding norm).

If you don’t normalize, then your dot-products will progressively get more and more off as you reduce dimensions (the dot-products get smaller and smaller, approximately linear with dimension reduction, which might be what you are seeing).

If you do this normalization, in conjunction with truncation, your MTEB scores should reduce gracefully, and not dramatically.

1 Like

Thank you, but my truncated embeddings are all normalized indeed. (The full logic for truncation is (embedding first: newSize) normalize, which is equivalent to your v1 = v1[:1024]; v1/np.linalg.norm(v1)).

However, I’m not computing an exact MTEB score but I just picked a random sample set of documents from my corpus and computed their distance to all other documents. In a perfect world, these distances would be identical for every truncated version of the embeddings, but in reality, they differ of course. For these differences, I have calculated the average (absolute) deltas and shown them in the diagram. I have not studied the entire MTEB paper but I think this benchmark here is pretty relevant in the context of similarity search/RAG tasks, what do you think?

Can you clarify what distance metric you are using?

I ask because your graph goes up to 5+. But dot products, from unit vectors of the same dimension, range from -1 to 1. So this doesn’t add up to me.

Yes, I am using a dot product. Sorry, should have clarified that the y numbers in the graph are percentage values. So, 5 means 5 percent or 0.05. I still need to fix that Smalltalk diagram package to display floats correctly. :wink:

even dimensions:2? That would be quite “perfect”. One only need apply such thought experiment to see that only if there was no value in those elements eliminated could this be realized.

More fun to experiment with: OpenAI put forth 256 dimensions as feasible. Now, step through all 256-dimension windows of the large embeddings model and see performance.

With “all 256-dimension windows” you mean all different subsets of 256 dimensions? I think the entire point of “native support for shortening embeddings” is that the dimensions are already sorted in descending order of relevance.

Yes, of course this “perfect world” breaks down for only a couple of dimensions. I just wanted to motivate my experiment.