Minimum embedding dimension

Samad_Koita1 · August 11, 2025, 8:07pm

Hi,

I am using text-embedding-3-large on my dataset. For storage reasons, I want to use as few dimensions as possible. As per my understanding, the embeddings are trained using Matryoshka Representation Learning . This technique has some discrete values on which the linear projects are trained (eg: 256, 512, 1024…), and can interpolate to any dimension between these values. However, truncating the embeddings below the lowest threshold can result in significant information loss. I tried searching the documentation and cookbook and was unable to find any reference to what this minimum value of dimension is that the model was trained on. Can someone help here?

dmitryrichard · August 11, 2025, 8:17pm

wdym bro? most open ai embedding come stock 1536 unless you are using the lesser models, but if you are concerned with dim then you also know you can break down or compound, i think its 3 1536 = 40**

but you can choose the dim in your code - like local fast you run typically 384 with a sentence transformer ( dont recommend tho 1536 are easy and manageble)

_j · August 11, 2025, 8:30pm

Matryoshka learning, that is likely used here but like everything else about the model is not discussed or documented, wraps shells of learning within another.

OpenAI discusses 256 dimensions on embeddings-3-large as still highly-performative. No mention of anything smaller. So the discovery you can make for us is to go smaller:

trial1 = embeddings[0:128] # elements 0..127
trial2 = embeddings[128:256] # elements 128..255

Normalize the array, and see if one is more performative than the other in semantic comparisons of varying types.

Both will still work.

_j · August 11, 2025, 8:43pm

BTW, you might have an underlying simple question: How do I get the vector database disk storage smaller?

I’ve found that casting to FP16 is nearly identical, and that appropriately scaled and clipped in8 or biased FP8, with mantissa sized for embeddings, is not much worse than the variation from the models’ non-determinism.

Samad_Koita1 · August 11, 2025, 9:46pm

Thanks for the update here! We have tested on a variety of embedding sizes, but I’m curious to understanding what was the minimum cutoff that was applied at training. We can pass dimensions directly in the embedding API itself. This supports all dimensions between 0 and 3072 without any warning/error.
Also @_j we have experimented with byte8 representations as well but are seeing a regression in performance. We are using a managed opensearch cluster so it does not support float16 disk representations for vectors.

sergeliatko · August 15, 2025, 12:14am

Storage reasons… Personally, I think storage is the cheapest of your problems, way cheaper than trying to win the quality race using low precision vectors…

Some 1.5K dimensions is kind of “minimal” as you need precision if searching in big haystacks (I presume storage is the problem because of how many objects you need to dig through?)…

Topic		Replies	Views
Text-embedding-3-large at 256 or 3072 dimensions API gpt-4	0	954	October 3, 2024
Embeddings performance difference between small vs large at 1536 dimensions? API embeddings , vector-db	11	17873	April 13, 2024
Query embedding threshold evaluation with curbing dimension API embeddings	1	295	August 5, 2024
Are OpenAI text-embedding-ada-002 embedding model greater than text-embedding-3-large? Community embeddings , chatgpt , api	1	2172	February 21, 2024
What should the size of string be for which to fetch vector embeddings from openai api API embeddings	1	151	August 21, 2025

Minimum embedding dimension

Related topics