It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

I am not using a K-means clustering in my approach above. I am using different stages of vector correlations, at different resolutions (dimensions). So this is an across the board, full search.

The vectors are in memory, and so is the hash of the text. The actual text exists in a database.

So “Layer 1” would be the coarse 256 dimension layer. If using something like AWS Lambda, you can get 2.4 million embeddings per 10 GB RAM (assuming 256 bit hashes here). Depending on your target latencies, you would stuff more or less vectors in each of the memory chunks. So for a latency benchmark of 1 second, using Python and 1536 dimensions, you can run 400k vectors per memory chunk. So the 256, would yield more correlations per second, but since it scales quadratically, I would have to benchmark how many vectors per memory slice to hit the desired latency. Same comments for the fine-grained 3072 dim, or “Layer 2”.

Once you get your latencies squared away, you now know how many vectors per mem chunk you can process. So you scale this to X chunks, and then the infrastructure will auto scale out for you when you do repeated Event calls.

Then you collect all the events in a DB, check for completeness, and send the resulting top-K to the next stage. You also send the information on which mem slice at 3072 dims, and the hashes. This stage you don’t do any correlations, just get the vectors in memory and return to another DB record. Once this record comes in, you can do your final correlation, on the filtered data at 3072 dimensions.

This is a basic naive implementation, but it gives the full argmax correlation back. You could do this for hundreds of millions of records, and get close to 2-3 seconds of latency. And it would be pretty cheap.

You could also do this on a big server, but you tend to pay more for hosting, so this is all traffic dependent.

3 Likes