Embeddings: The average and extreme values within dimensions of 3-large

I was doing some deeper analysis of what is returned by embeddings models, because I have these possibilities to explore:

  • client-side vector dimension reduction and normalization
  • using 8-bit floating point and integer representations of embeddings values
  • scaling vectors to minimize quantization within the limited dynamic range of less than 256 values of 8 bit formats.
  • dynamically adapting that scaling to the increased vector values after reduced dimensionality of vectors.
  • allowing clipping: allow overflow and limit to representable values in the number format

Destination 8 bit formats (from 32 bit):

int8: the NumPy version, 8-bits being the smallest computation unit - range (-127, 127) for symmetry.
float8: from ml_dtypes import float8_e4m3b11fnuz - a ML scaled format, exponent 4 bits, mantissa 3 bits, bias 11, with extended range instead of inf or NaN.

So just a brain dump of some findings.


First: run many 3-large embeddings and get statistics:

This is a bit wider than the training weights for AI models. What is notable is how sparse and few the extremes are:

Top 10 magnitudes of all embeddings_fp32:

1: 0.09365
2: 0.09023
3: 0.08955
4: 0.08892
5: 0.08727
6: 0.08398
7: 0.08193
8: 0.08049
9: 0.07839
10: 0.07725

  • 99th percentile of absolute values: 0.05101
  • 99.9th percentile of absolute values: 0.06997

If my intermediate format is to be range(-1,1), then employing a multiplier like the maximum absolute value ever received is not ideal.

If I instead scale more than the maximum, for 0.07 to be my “1.0” (instead of 0.094 to be 1.0), I’ve only damaged 0.1% of the values by saturation, and those still have meaning and magnitude.

With clipping, a multiplier almost double what one might expect gives better results from higher precision on the stepping of values.


Lowered Dimensions

text-embeddings-3-xxx allows you to use a dimensions parameter - or you can do it yourself client-side as many times as you want from one API call.

Then, when reducing the tensor array dimensions by truncating the end of embeddings and re-normalizing (like the API does when you request fewer dimensions as a parameter, which is now supported with semantic preservation on “3-xxx” embeddings models), we observe the trend in maximum magnitude within a large set of embeddings:

embedding_map = {
    3072: 0.093654,  # Previous: 124
    2048: 0.104764,  # Previous: 119
    1536: 0.109215,  # Previous: 112
    1024: 0.124259,  # Previous: 110
    768:  0.138169,  # Previous: 109
    512:  0.159741,  # Previous: 109
    384:  0.184648,  # Previous: 113
    256:  0.214117,  # Previous: 113
    192:  0.252522,  # Previous: 120
    160:  0.272869,  # Previous: 121
    128:  0.315785,  # Previous: 127 (clip)
    96:   0.297409,  # Previous: 109
    64:   0.336869,  # Previous: 106
    32:   0.423982,  # Previous: 104
    16:   0.629729,  # Previous: 119
}

In blue, we can see the values increasing when dimensions are decreased. The comments are some of my first trials to exponentially scale values, with the results of the maximum being mapped to int8.

Notable is not just a linear relationship, but even when a first-generation exponential scaling, a peak in the scaled values using 128-192 dimensions (depending on what tune-ups are being done).


Then: The final result of a function with a second-order correction that simply takes model name, and does all the magic based on the size of embeddings passed.

embed_model_multiplier = formula(model, embeddings.shape[1])
…

 Dim     Result     Scalar
3072:      1.017      10.85
2048:      1.033       9.86
1536:      1.018       9.33
1024:      1.026       8.26
 768:      1.019       7.37
 512:      0.985       6.16
 384:      0.998       5.40
 256:      0.960       4.48
 192:      0.992       3.93
 160:      0.986       3.61
 128:      1.031       3.27
  96:      0.853       2.87
  64:      0.806       2.39

Above “result” is based on the scaling that would have targeted minimal clipping - like 1 in 10000. The table shows I prefered to bias so that more dimensions get more clip - that’s actually a smaller percentage of the total values. Then for ideal clipping and minimum quantization loss, we find out, multiply that “perfect scaling” by 1.5-2x

(if you’re curious, ada-002 is pretty much x1.5, as we can’t attempt dimensional reduction)

The range (-1, 1) is then mapped to the destination format.

This adds to the embedding time, but the local computation is a fraction of the API time, and doesn’t need to be repeated on a corpus. The result is an immediate storage 1/4th the size, and intermediate “fast” results that can use the reduced bit depth values directly for SIMD-optimized cosine similarity (basically renormalization + dot product).

== Cosine semantic similarity comparisons ==

  • The final results: similarity values done with 32 bit binary values returned by API, vs squished and scaled 8-bit.

1:" the styles aren’t written in CSS syntax since this" -
match score: fp32: 0.0716 / fp8: 0.0723 / int8: 0.0730
2:" ## Latency optimization Improve latency across a w" -
match score: fp32: 0.2035 / fp8: 0.2002 / int8: 0.2002
3:" Delete variable string Rust language" -
match score: fp32: 0.0775 / fp8: 0.0781 / int8: 0.0779
4:" f desired_dimensions < embeddings_fp32.shape[1]: " -
match score: fp32: 0.1168 / fp8: 0.1169 / int8: 0.1175
5:" Elvis was a hero to most, but a straight up sucka " -
match score: fp32: 0.0691 / fp8: 0.0691 / int8: 0.0693


A tad more about float8 formats for machine learning…

The precision of floating point is higher at center values, depending on the fraction of bits that is mantissa vs exponent. That aligns with the embeddings histogram. The steps are larger at extremes. int8 wastes precision on the extremes for few embeddings values.

My chosen float8 format is also one available on NVidia hardware (but not really practical to write an accelerator just for embeddings computations). The max magnitude is 31.99. (A different bias could scale the number format itself, but I want easy reuse. I just scale and use cosine similarity which isn’t affected by this.)

1 Like