It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

There are certainly some novel compression schemes out there, and different ways to represent numbers. For compression on disk, assuming a Python implementation, you could use compressed pickles with gzip, for example.

I think the overhead with these ML packages, like Tensorflow, can easily can get out of hand, and aren’t necessary beyond basic Numpy, or simple low level routines in C/Rust, for doing embedding analysis. May be required for massive massive things. Although GPU’s could still come in handy for this problem if you have one laying around.

So with this in mind, there are simple ways to get full dynamic range for each embedding chunk by storing an extra FP number per vector.

For each embedding vector, you find the largest magnitude, then you rescale these values to this largest magnitude, assuming two’s compliment, like so for this 16 bit example:

Vectorized Python conversion for 16 bit.
import numpy as np

def convert_array_np(arr):
    arr = np.clip(arr, -1, 1)  # Ensuring values are within [-1, 1]
    scaled_arr = np.round(arr * 32768).astype(int)
    scaled_arr = np.clip(scaled_arr, -32768, 32767)  # Clipping to 16-bit signed int range
    return scaled_arr

# Example usage
arr = np.array([0.5, -0.5, 1, -1, 0.1])
converted_arr = convert_array_np(arr)

In the code above, you pre-process the array arr by dividing the entire array by the max magnitude (Positive or negative, but store only magnitude. Note: This is not done in the code above, you would do it outside and before this function). And you record this magnitude for later. Also you’d have to recast this array to 16 bit signed integers, not shown.

You now have max dynamic range on this embedding vector for your chosen bit depth.

You do your integer math for the dot-product, then you scale the result by both the input max magnitude and the target knowledge vector max magnitude. So essentially 2 extra FP multiplies. So if the first one hit -0.734, and the second one hit 0.539, you correlate and multiply the correlation by 0.734*0.539. And of course you divide by 2^{15} if you want to get back to \pm 1 cosine similarities.

As for \mu-Law, my understanding is that is more to do with speech and human hearing. But I have more of a DSP, non-audio background, so I’m not sure.

Instead, the embeddings lend themselves more to the non-audio, pure DSP signal processing domain, IMO, since they aren’t audio per se. :man_shrugging:


Appreciate your ideas, and sorry for the late reply. You are describing splitting up the DB into 256 clusters, and I would worry that some embeddings would be so close to the edge of one cluster that I would miss important similar results when only performing a search inside the single cluster. So probably I would also need to consider adjacent clusters. But the general idea is very compelling and I am considering to give that a try soon! :slight_smile:

Let’s revisit the initial kernel of the idea here:

There is no “256 cluster” or “edge” within the embedding dimensions themselves. There may be some effect of hierarchy, leading to more specificity and an indescribable abstraction of semantics as one progresses through the results, rooted in the underlying ML layers. Alternatively, it could simply be that dimensions with the most language discrimination value have been remapped to allow for effective truncation, thereby losing their origin story.

If dimension 413 has a strong activation from a concept like “is it red or is it blue”, for example, then you either do or don’t have that aspect of discrimination available when you select a subset of dimensions to work with.

If I didn’t miss some diversion mid-topic, this is what is described:

Preparation and Addition:

  1. Chunk entities of knowledge into a database (I don’t see the case for a hash over an index).
  2. Obtain default max dimension embeddings from the API.
  3. Store parallel pre-normalized “fast embeddings” of smaller dimension or bit depth. These should be indexed to full, or rank-2 tensor (for example: d=256@8).
  4. Store full quality for high-precision search. This might not be original data; one might consider the size of disk or flash clusters and target your reduction to 4kiB or 8kiB if the database is hardware and file system-based (for example: d=2048@16).

Runtime Environment

  • Load small embeddings into RAM.
  • Load full embeddings into near storage.
  • Disk text.


  • Conduct an exhaustive search on fast for top-k.
  • Load top-k full quality embeddings (example result: 10%).
  • Conduct an exhaustive search on top-k with HQ (example result: 1%).
  • Retrieve full text and metadata.
  • More qualification of source contextual neighbors, document reconstruction, etc.

Such an implementation remains understandable, and is not a stretch for existing vector databases for semantic search.

1 Like

Thank you @_j , perfect summary!

All I am talking about is a processing pipeline that leverages small/fast embeddings (from truncation, which then extends to the full embedding, which is important, since it maintains coherency). Then forming small local bundles from a hash/index whatever you want here to increase your surface area to reduce memory footprint on the larger/slower, but more detailed version of those same small embeddings.

The key here is that truncation maintains coherency between small and large. So you aren’t jumping around between small and large models that are different, which could be viable, but you lose coherency, and end up jumping around when using mixed models in as a series of cascading filters, so my eyebrows get raised when going down that path, and I’d like to avoid thinking about it.

If you have the compute, you could coherently combine across models … which I alluded to up here. But this was a tangential topic, not related to low memory situations.

Maybe when the big book on RAG gets created, there will be discussion of these topics and considerations. :rofl:

But the bottom line, is that truncation/re-scaling appears viable with the new OAI embedding models (maybe others depending on if they properly ordered their dimensions, or you order them yourself with PCA) and you have a clear path of creating a cascade of coherent filters, and memory efficient, which then open up the situation to handling massive amounts of documents as RAG context.

No “edges” or clustering required. It’s a way of getting approximate optimal argmax correlations without going full brute force. It may even be competitive with the nearest-neighbor based engines like FAISS. Why? Well your small/fast course search instantly creates your N-N set.


“My first big book of RAG” I’m howling laughing here… ohh boy.

1 Like