I am using text-embedding-3-large on my dataset. For storage reasons, I want to use as few dimensions as possible. As per my understanding, the embeddings are trained using Matryoshka Representation Learning . This technique has some discrete values on which the linear projects are trained (eg: 256, 512, 1024…), and can interpolate to any dimension between these values. However, truncating the embeddings below the lowest threshold can result in significant information loss. I tried searching the documentation and cookbook and was unable to find any reference to what this minimum value of dimension is that the model was trained on. Can someone help here?
wdym bro? most open ai embedding come stock 1536 unless you are using the lesser models, but if you are concerned with dim then you also know you can break down or compound, i think its 3 1536 = 40**
but you can choose the dim in your code - like local fast you run typically 384 with a sentence transformer ( dont recommend tho 1536 are easy and manageble)
Matryoshka learning, that is likely used here but like everything else about the model is not discussed or documented, wraps shells of learning within another.
OpenAI discusses 256 dimensions on embeddings-3-large as still highly-performative. No mention of anything smaller. So the discovery you can make for us is to go smaller:
trial1 = embeddings[0:128] # elements 0..127
trial2 = embeddings[128:256] # elements 128..255
Normalize the array, and see if one is more performative than the other in semantic comparisons of varying types.
BTW, you might have an underlying simple question: How do I get the vector database disk storage smaller?
I’ve found that casting to FP16 is nearly identical, and that appropriately scaled and clipped in8 or biased FP8, with mantissa sized for embeddings, is not much worse than the variation from the models’ non-determinism.
Thanks for the update here! We have tested on a variety of embedding sizes, but I’m curious to understanding what was the minimum cutoff that was applied at training. We can pass dimensions directly in the embedding API itself. This supports all dimensions between 0 and 3072 without any warning/error.
Also @_j we have experimented with byte8 representations as well but are seeing a regression in performance. We are using a managed opensearch cluster so it does not support float16 disk representations for vectors.
Storage reasons… Personally, I think storage is the cheapest of your problems, way cheaper than trying to win the quality race using low precision vectors…
Some 1.5K dimensions is kind of “minimal” as you need precision if searching in big haystacks (I presume storage is the problem because of how many objects you need to dig through?)…