It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

I mean: compare the quality of 0-255 to 256-511, and so on, on the same model.

See if it isn’t exactly that the semantic search evaluation rolls off in clarity when using different later dimension subsets of the output.

One might even find an ideal “discard” threshold. - or find that the last 256 dimensions are useless to stand alone.

Ah, OK, your graph makes sense now. :sweat_smile:

Well, so you are saying that there is a 5% dot-product variation when you go from 3072 dimensions to 256 dimensions.

This isn’t too surprising to me. It suggests that 95% of the important variation is in the first 256 dimensions. So I am guessing they ordered the dimensions, using PCA or something similar, ordered from most important to least important, which is why the vectors can simply be truncated/scaled down to less dimensions.

You get a huge dimension reduction … and a huge search speed up, and reduced storage footprint … but this speed trade comes a loss in quality. But 5%? Sounds pretty good to me, although this is not any real world indication of performance.

What really matters, besides MTEB, is whether the results even make sense, even at 3072 dimensions.

In general, the sentiment I’ve gathered is that the newer models are better, and also have more dynamic range (0 to 1, instead of 0.7 to 1 from ada-002), so new thresholds need to be used.

I am planning on evaluating the new models soon, and adjust my thresholds.

Any opinions on retrieval or semantic similarity performance? In light of this 5% variation?

The cool thing, and one thing I wanted to emphasize in this thread, is you can carve out your custom dimensions from the large model. So you can go from 3072 to 256, and anything in between, to custom tune your exact speed vs. quality performance trades.

And over time, if you keep your original 3072 vector, you can re-map and tune further.

2 Likes

Are you talking about changing v1[0:1024] to v1[1024:2048] etc.? Well, I still don’t see why this would lead to better results. Here is an example with v1[n:2*n]:

Yes, you are right, that dimensionality reduction must go somewhere. :sweat_smile: I just wanted to point out that in my use case (especially similarity search), the loss in recall (and also precision) is pretty painful. It might be different when you have a highly redundant corpus of documents.

Still, being able to truncate your vectors on demand could be very helpful. Even though it makes your pipeline (maintain multiple versions of the vectors in different memory layers, synchronize them, fetch/store them on demand) even more complicated. :slight_smile:

1 Like

Only thinking of v[0:N], where 256 \le N \le 3072. So you chop off the back of the vector, and renormalize.

I don’t think chopping off the ends, and keeping the middle part is good, especially if the theory is … (at least mine anyway) … is that the the most important dimensions, that explain the most variation, are expressed in linear order, from dim 1 most important to dim 3072 least important.

You could run an experiment to test this. For example, what results do you get with v[0:1536] vs. v[1536:3072].

I would expect v[0:1536] to give the better results.

Exactly … but also think of the case where you have billions of embeddings. So you first get your top 1000 hits off of the 256 dim set, and then get your fine-tuned results with the 3072 dim versions off the downselected 1000.

So you can form embedding layers … layered by dimension … that act as effective pre-filters.

So you get your 5% error at 256 dims (first 256) coarse correlation, but dig up the top K hits to then perform a 3072 fine-grained correlation against. The fine-grained will give you the performance, and if K is large enough, you can burn down that 5% error if you let enough into the fine-grained 3072 dim search.

All of this off of creating a single 3072 “mother vector” that you keep in a safe spot for later.

1 Like

Yes, you’re right, that’s what my second chart above attempts to show. :slight_smile:

But how do you load all the K fine-grained embeddings from the disk efficiently? Unless they are all stored in a row, I imagine this will be the bottleneck. To store them all in a row, I think you will still need a proper clustering approach as used in IVF. The last time I attempted this I canceled my experiments because I found the parameter tuning for k-means too tedious. :sweat_smile:

I am not using a K-means clustering in my approach above. I am using different stages of vector correlations, at different resolutions (dimensions). So this is an across the board, full search.

The vectors are in memory, and so is the hash of the text. The actual text exists in a database.

So “Layer 1” would be the coarse 256 dimension layer. If using something like AWS Lambda, you can get 2.4 million embeddings per 10 GB RAM (assuming 256 bit hashes here). Depending on your target latencies, you would stuff more or less vectors in each of the memory chunks. So for a latency benchmark of 1 second, using Python and 1536 dimensions, you can run 400k vectors per memory chunk. So the 256, would yield more correlations per second, but since it scales quadratically, I would have to benchmark how many vectors per memory slice to hit the desired latency. Same comments for the fine-grained 3072 dim, or “Layer 2”.

Once you get your latencies squared away, you now know how many vectors per mem chunk you can process. So you scale this to X chunks, and then the infrastructure will auto scale out for you when you do repeated Event calls.

Then you collect all the events in a DB, check for completeness, and send the resulting top-K to the next stage. You also send the information on which mem slice at 3072 dims, and the hashes. This stage you don’t do any correlations, just get the vectors in memory and return to another DB record. Once this record comes in, you can do your final correlation, on the filtered data at 3072 dimensions.

This is a basic naive implementation, but it gives the full argmax correlation back. You could do this for hundreds of millions of records, and get close to 2-3 seconds of latency. And it would be pretty cheap.

You could also do this on a big server, but you tend to pay more for hosting, so this is all traffic dependent.

3 Likes

Maybe I should clarify the scope of my vector DB! I am not using any servers but am trying to achieve a performant vector DB on the user’s end - so my “Layer 1” would be local RAM and should be max 1 GB (preferred less) and my “Layer 2” would be local disk where everything up to 10 GB would be fine. A typical DB size has 100k-200k documents, so with text-embedding-3-large @ 3072 dimensions, I am at ~2 GB for the embeddings. Search times (full brute force search) are already fast enough, so my only goal is to reduce memory consumption.

If I understand you correctly, your Layer 2 would still hold the embeddings in the RAM, right? So you maintain a lookup table from long embeddings to short embeddings? This is something that won’t work for me because random disk accesses are slow. It would only work if similar embeddings were stored in the same (or a small number of) cluster …

1 Like

What I had envisioned when I responded was that there is flexibility in holding the 3072 vectors in RAM or not, depending on your latency requirements.

In your case, since RAM is at a premium, you would probably want to shard or fragment the 3072 into multiple smaller local files on disk. An easy indexing scheme, assuming you are using a hex based hash, would be use the first N characters of the hash as the file designator. Note: Hash here means the hash of a chunk of text in your knowledge data set. These are stored alongside your vectors, and you probably have the original text stored alongside this as well.

So hex is base 16, and if you use the first 2 hex characters, you would have 256 local files. So “shard_3072_00.pkl”, … “shard_3072_ff.pkl” assuming a Python pickle implementation. If you add another hex character, you get 4096 shards. So you keep increasing the characters to increase the surface area, and reduce the memory footprint.

Then, when you get the hash id’s, you pull in all the shards from disk that match your first N hex characters. For N large enough, this should reduce the memory footprint of the 3072 dimension vectors quite nicely.

Then you perform your fine-tuned search over the collection of these smaller shards read in from disk. But don’t correlate over the whole set in the shard, just on the single vector with the specific hex id. So you are grabbing bundles of vectors, based on first characters of the hex hash. The bundles are small, take up low memory for large N, and get you your fine-tuned vectors you found in “Layer 1”.

Beyond a disk/file implementation, if you ran a local database, you could also just retrieve the vectors quickly from local DB calls, and run the correlation directly from getting the vectors from the DB, using the hashes directly, and not worrying about shards from disk. But not sure this has a small memory footprint, unless they use inodes locally to keep it all small. Maybe better than this is to just use the file based version I mentioned above, since it has no memory overhead :rofl:

So it’s pretty flexible depending on your situation.

2 Likes

I thought I’d find out the dot product quality if I stored the 32 bit embeddings as 16 bit floats (np.float16). Math is still done with astype doubles.

Testing on chunked OpenAI blog. 3-large at 1536.

== Cosine similarity comparisons of 16 vs 32 bit dB==
  • 0:“What is an OpenAI GPT useful for?” <==> 0:" What is an OpenAI GPT useful for?" -
    float32: 1.000000
    float16: 1.000000
  • 0:“What is an OpenAI GPT useful for?” <==> 1:" [1] Documentation does not superced" -
    float32: 0.427506
    float16: 0.427514
  • 0:“What is an OpenAI GPT useful for?” <==> 2:" [2] Both of our new embeddings mode" -
    float32: 0.293851
    float16: 0.293856
  • 0:“What is an OpenAI GPT useful for?” <==> 3:" [3] We’re rolling out custom versio" -
    float32: 0.612172
    float16: 0.612169
  • 0:“What is an OpenAI GPT useful for?” <==> 4:" [4] GPTs let you customize ChatGPT " -
    float32: 0.480640
    float16: 0.480641
  • 0:“What is an OpenAI GPT useful for?” <==> 5:" [5] The GPT Store is rolling out la" -
    float32: 0.465091
    float16: 0.465101
  • 0:“What is an OpenAI GPT useful for?” <==> 6:" [6] We’ve set up new systems to hel" -
    float32: 0.501608
    float16: 0.501614
  • 0:“What is an OpenAI GPT useful for?” <==> 7:" [7] Developers can connect GPTs to " -
    float32: 0.435705
    float16: 0.435714
  • 0:“What is an OpenAI GPT useful for?” <==> 8:" [8] Since we launched ChatGPT Enter" -
    float32: 0.520939
    float16: 0.520944
  • 0:“What is an OpenAI GPT useful for?” <==> 9:" [9] We want more people to shape ho" -
    float32: 0.558575
    float16: 0.558581
  • 0:“What is an OpenAI GPT useful for?” <==> 10:" [10] Creating a GPTHow to create a " -
    float32: 0.527425
    float16: 0.527424
  • 0:“What is an OpenAI GPT useful for?” <==> 11:" [11] Here’s how to create a GPT: H" -
    float32: 0.517636
    float16: 0.517639
  • 0:“What is an OpenAI GPT useful for?” <==> 12:" [12] Advanced SettingsIn the GPT Ed" -
    float32: 0.278446
    float16: 0.278451
  • 0:“What is an OpenAI GPT useful for?” <==> 13:" [13] Settings in the Configure tab:" -
    float32: 0.436078
    float16: 0.436079
  • 0:“What is an OpenAI GPT useful for?” <==> 14:" [14] FAQ: Q: How many files can I u" -
    float32: 0.218532
    float16: 0.218541

Still five decimal points (six shown).

I’ll write more playground using tensorflow and see how quantized I can get storage before there is degradation of thresholds or top-k starts to flip. 8-bit would be the next goal. I expect OpenAI has already done some bit-crushing in their AI (also see ada-002’s non-determinism).

Then the utility is actually finding how much memory is reduced, and if you have a vector db (or have to start from scratch) or data structure that will actually save runtime memory.

2 Likes

Also, you should consider 16 bit signed integers. In python/numpy the multiplies are 163 times faster than 16 bit floats, at least according to ChatGPT :sunglasses: It sounds insane, I’d have to test it out to be sure.

32/16 bit multiply comparison code benchmark from ChatGPT
import numpy as np
import timeit

# Set up the arrays for 16-bit integer and 16-bit float multiplication
int_array = np.random.randint(0, 1000, size=100000).astype(np.int16)
float_array = np.random.rand(100000).astype(np.float16)

# Define functions to multiply arrays
def multiply_integers():
    return int_array * int_array

def multiply_floats():
    return float_array * float_array

# Measure performance
int_time = timeit.timeit(multiply_integers, number=1000)
float_time = timeit.timeit(multiply_floats, number=1000)

int_time, float_time

But the idea is take the vector, multiply it by 32767 and recast as a 16 bit signed integer. Use something like numpy, which uses optimized libraries, or code it in something lower level like C or Rust.

You would adjust your thresholds, or divide the final answer by 32767 and use your conventional thresholds.

The only thing about 16 bit, is that the smallest value represented is roughly 3.05185×10 ^{−5}. Which might be on the edge of the information from the embedding. If this is a concern recast to 32 bit signed integers after multiplying by 2^{31}−1=2,147,483,647

That should take care of it. :rofl: Apparently the time difference is small computationally, but you are doubling your memory footprint.

Another trick is to do no multiplies at all! How? Use the Manhattan metric, or L1 norm.

d(\mathbf{a}, \mathbf{b}) = \sum_{i=1}^{n} |a_i - b_i|

So sign change the second vector, add to first vector, absolute value, sum. No multiplies whatsoever.

Should be super fast, even faster with integers.

1 Like

Or some GPU…

 cosine_sim = tf.keras.losses.cosine_similarity(tensor1, tensor2, axis=-1)

However, I was just contemplating memory size of a live vector store. 256 x 4 bytes = 1k while 1024 x 1 byte = 1k, and the latter is likely going to benchmark better for results.

Then you contemplate, 4k vector for 4k data chunks?

You could shoot for a larger quantization for the longer vector too. So 4 bits per vector entry at “Level 1”, 8 bits at “Level 2” and so on until you hit your highest quantization state (16/32/64 bits)

The reason why I would go shorter, with more bits, is that I have a theory the vector dimensions from the new OpenAI models are ordered and so you want to devote more bits to the lower dimensions to retain your performance.

So a more quantized 3072 dim vector is like the same quantization at 256 dimensions, as the higher dimensions only add polish, and not much substance. At least that’s my current hypothesis.

For fun you could go multi bit quantization. So dims 1:256, 32 bit. Dims 257:1024, 16 bit. Dims 1025:3072, 8 bit. Something like that. Complicates the correlation, since you now have 3 chunks, but you rescale and combine to form your overall answer, while allocating bits only where the information content is.

You could also run these multi bit chunks in parallel for some extra speedup gains.

This multi bit architecture “flattens” the whole thing, making it easier too, and you avoid multiple stages, which add latency.

BTW, you can do this with any embedding model. After you embed a bunch of things important to you, you would then rank each dimension using PCA to form your permutation vector, and therefore you can now decide your bit boundaries.

I’m struggling to follow this thread from a technical perspective, but when I read Curt’s very first post I wondered, can we go in the opposite direction, i.e., use the wonderful davinci embeddings model that is now deprecated, and then scale down as necessary for the currently available models. I noticed a very clear degradation of results when I had to redo my corpus’ embeddings from davinci to ada-002. My corpus comprises complex legal text within a narrow topical domain. So the richer the embeddings, the better. Ada-002 can barely cope. I had to make several changes to the rest of my code to wrangle my outputs back to their previous quality. That was frustrating. I wish OpenAI hadn’t deprecated davinci embeddings. For high stakes use cases involving very complex language - the exact use cases where LLMs are of greatest value-add - the extra price for extra dimensions is well worth it.

I think this is possible now. Since this is what it looks like the models are doing internally.

But it would come at a cost increase to get those big DaVinci vectors back.

I wish the models weren’t deprecated as fast as they are too … a model seems to last, what, maybe a year or so?

This makes the O&M much higher. And I would have to run multiple embedding engines, and continuously upgrade each one and monitor its performance.

But speaking of multiple dimensions, have you thought of literally sticking a bunch of vectors together, from different models, to form the large massive vector, that maybe could be useful to you?

So if each vector is a unit vector, and say you fuse 10 models together to form some massive 15k dimensional vector, like DaVinci, then you take the dot product, and divide by 10 to get your new “mega vector” dot product.

This vector would contain information from 10 different models.

Would this help you?

PS. Creating this large 15k vector isn’t totally necessary. As you could just correlate the 10 models, sum the answer, and it will be the same.

The main difference here is that stacking models like this is coherent processing, whereas performing hybrid rankings from different models is non-coherent processing.

Large vectors, like DaVinci can be coherently synthesized from multiple models.

The question is … is coherent better than non-coherent? My gut feel is that coherent is better, because more dimensions are better, as you’ve experienced.

2 Likes

The base64 embeddings API return contains the exact 4 bytes of 32 bit floats.

A peculiar but expected thing happens when you request alternate dimensions: the maximum values are smaller with more dimensions.

So it looks like one would want to store a scale factor along with the tensor, for ideal quality and a multiplication by a fraction operation upon unpacking from RAM.

8 bit results

Here’s FP8 e4m3b11fnuz - Exponent: 4, Mantissa: 3, bias: 11 (Extended range: no inf, NaN represented by 0b1000’0000). 3-large d:1536 as before, scaled by x16.

== Cosine similarity comparisons ==

- 0:"What is an OpenAI GPT useful for?" <==> 1:" [1] Documentation does not superced" -
 float32: 0.427506
 float08: 0.427878
- 0:"What is an OpenAI GPT useful for?" <==> 2:" [2] Both of our new embeddings mode" -
 float32: 0.293851
 float08: 0.293858
- 0:"What is an OpenAI GPT useful for?" <==> 3:" [3] We’re rolling out custom versio" -
 float32: 0.612172
 float08: 0.612752
more semanic similarity 32 vs 8
- 0:"What is an OpenAI GPT useful for?" <==> 4:" [4] GPTs let you customize ChatGPT " -
 float32: 0.480640
 float08: 0.480849
- 0:"What is an OpenAI GPT useful for?" <==> 5:" [5] The GPT Store is rolling out la" -
 float32: 0.465091
 float08: 0.464570
- 0:"What is an OpenAI GPT useful for?" <==> 6:" [6] We’ve set up new systems to hel" -
 float32: 0.501608
 float08: 0.500995
- 0:"What is an OpenAI GPT useful for?" <==> 7:" [7] Developers can connect GPTs to " -
 float32: 0.435705
 float08: 0.435533
- 0:"What is an OpenAI GPT useful for?" <==> 8:" [8] Since we launched ChatGPT Enter" -
 float32: 0.520939
 float08: 0.520229
- 0:"What is an OpenAI GPT useful for?" <==> 9:" [9] We want more people to shape ho" -
 float32: 0.558575
 float08: 0.558209
- 0:"What is an OpenAI GPT useful for?" <==> 10:" [10] Creating a GPTHow to create a " -
 float32: 0.527425
 float08: 0.526704
- 0:"What is an OpenAI GPT useful for?" <==> 11:" [11] Here’s how to create a GPT:  H" -
 float32: 0.517636
 float08: 0.517361
- 0:"What is an OpenAI GPT useful for?" <==> 12:" [12] Advanced SettingsIn the GPT Ed" -
 float32: 0.278446
 float08: 0.277858
- 0:"What is an OpenAI GPT useful for?" <==> 13:" [13] Settings in the Configure tab:" -
 float32: 0.436078
 float08: 0.434184
- 0:"What is an OpenAI GPT useful for?" <==> 14:" [14] FAQ: Q: How many files can I u" -
 float32: 0.218532
 float08: 0.217219
embeddings values sample - 32 vs 8

0 [‘+0.02560364’, ‘-0.01831481’, ‘-0.01141823’, ‘-0.02364949’]
0 [‘+0.02539062’, ‘-0.01757812’, ‘-0.01171875’, ‘-0.02343750’]
1 [‘+0.01583907’, ‘-0.01921137’, ‘-0.01613504’, ‘-0.01042185’]
1 [‘+0.01562500’, ‘-0.01953125’, ‘-0.01562500’, ‘-0.01074219’]
2 [‘+0.04101363’, ‘-0.03312357’, ‘-0.00581640’, ‘-0.00354271’]
2 [‘+0.03906250’, ‘-0.03125000’, ‘-0.00585938’, ‘-0.00366211’]
3 [‘+0.02819351’, ‘+0.00425646’, ‘-0.02809555’, ‘-0.01454738’]
3 [‘+0.02734375’, ‘+0.00439453’, ‘-0.02734375’, ‘-0.01464844’]
4 [‘+0.02966771’, ‘+0.00471444’, ‘-0.01965263’, ‘-0.01287654’]
4 [‘+0.02929688’, ‘+0.00488281’, ‘-0.01953125’, ‘-0.01269531’]
5 [‘+0.02289144’, ‘+0.01062817’, ‘-0.02655740’, ‘-0.01890262’]
5 [‘+0.02343750’, ‘+0.01074219’, ‘-0.02734375’, ‘-0.01953125’]
6 [‘+0.04714031’, ‘+0.00739293’, ‘+0.00643247’, ‘-0.00565372’]
6 [‘+0.04687500’, ‘+0.00732422’, ‘+0.00634766’, ‘-0.00585938’]
7 [‘+0.04821001’, ‘-0.00359391’, ‘-0.01465362’, ‘-0.01400830’]
7 [‘+0.04687500’, ‘-0.00366211’, ‘-0.01464844’, ‘-0.01367188’]

Yes, this is expected. Since they are unit vectors, the average value has a magnitude of 1/\sqrt N, where N is the dimension of the vector. And 1/\sqrt N \rightarrow 0 as N \rightarrow \infty.

So yes, an appropriate scale factor, say K>10*\sqrt N, should be applied before highly quantizing the values, so you don’t lose this information. So for 3072 dimensions, your scale factor should be at least 555 or so.

But if you use higher bits, where the smallest quant value is a small fraction of 1/\sqrt N, then this isn’t a big deal. So at 256 dimensions, at 16 bits, your LSB might be near 0.1/\sqrt {256} which is 0.00625, so you need less bits to express this “smallest value”.

However, since you are doing a correlation after the quantization, the correlation actually adds bits, so overly quantizing the larger vectors may not even be too noticeable after the correlation.

TL;DR More bits are needed to properly express the larger dimensional unit vectors. But the quant noise is reduced anyway after correlation.

@ 3072:

em_ndarray_scale = em_ndarray * 256
RuntimeWarning: overflow encountered in multiply

Might need to re-math that. Or I’m just being dumb with my multiply.

One could run some of the values into saturation with more processing upon retrieving, but the top values that can be represented by 7 bits of magnitude are already a bit quirky.

If OpenAI wasn’t doing normalization, it might even be possible to match up the FP8 their hardware could be doing to have minimum quantization error…

7 bits of magnitude, or 8 bits signed is really really quantized for a 3072 dim vector :rofl:

You could run a saturated system, but you need to clip the data before casting to the quantized version to avoid the error.

You should also measure the percentage of values that get clipped or saturated. You probably don’t want more than 5-10% of these values saturated, to avoid severely restricting your dynamic range, which is roughly 6.02\cdot B dB, for B bits of digitization.

So basically, you don’t want to over clip the bottom LSB (setting the values to 0) or over clip the top (setting values that hit your max bit magnitude).

You also want to have as much dynamic range as you can afford, but you’d have to look at the performance to see where the sweet spot is.

Unfortunately I don’t have a quick engineering formula to list what the dynamic range should be for embedding vectors of N dimensions quantized to B bits should be to get good performance :grimacing: or what the exact scaling factors should be to not over clip the bottom and top.

But some trial and error would quickly get your scaling and digital gain knob set. Then it’s off to the races to see how the embeddings perform at these settings.

While comparing some aspects to that of digital audio sampling is possible, such as “clipping”, audio is typically represented with a linear scale, so one “bit” more is always a doubling in magnitude, or equivalent lowering of the quantization noise. The 6dB.

However, the machine learning is passed to us in floats, from what is likely floating point inference.

In the E4M3 format, recommended for inference and transformer forward training by nVidia, there is a “precision” of three bits, and an exponent “scalar” that can represent anywhere from 2**-7 before zero, to +/-448 in magnitude (in the bias 7 version)

image

I expected that what we get from embeddings is similarly quantized some manner in its origin, then placed within the limited dynamic range of float32 normalization. (The precision of FP8 results are suspiciously similar to that of logit non-determinacy of chat models)

So possible values are much more exponential. Due to its ML origin, embeddings are likely similar. A technique of applying some log curve to the storage, to shift the dynamics of quantization error would likely just result in computation expense. If determined, one could gather a histogram of embeddings values to see where the most “value” lies in reducing quantization errors between steps. Then, 127 values of magnitude to decode could even be done with a LUT instead of math. (audio analogy: μ-law algorithm - Wikipedia)

Anyway, native numpy has minimum 8 bit array elements, any smaller is padded in RAM. I was foiled by 300MB of Tensorflow memory overhead by my 3.6GHz quad-core 48GB Xeon workstation not supporting AVX in current pip wheels - Tensorflow needing the latest version for FP8. Onnx seems a lighter ML library, but the compute isn’t what we’re after.

For NumPy, from ml_dtypes import float8_e4m3b11fnuz is all the storage satisfaction needed for this solution at 1/4th the memory consumption, because custom-packing even smaller bit-length tensors into memory just gets silly.

8-bit-stored comparisons, as shown, is more accurate than the AI’s semantic preference of chunks.