Embeddings for the same content vary. How can this be fixed?

We are currently using the text-embedding-3-small model to embed our documents. However, we have noticed that each time we create an embedding for the same document without changing any of the content, the resulting embedding varies.

This inconsistency is affecting our process of identifying the nearest embedding, as it results in different embeddings being picked each time.

By how much is it affecting the embeddings? What is the dot-product of the different embedding vectors for the identical text content?

If you’re checking for an exact match of one embedding vector to another that might be giving you a false sense that the embedding changed.

Like @anon22939549 said, the only way to compare embeddings is to check the actual vector similarity, not the byte values, or numbers of the vector itself. Even if the vectors look radically different they can point to the same location in semantic space.

Hi @wclayf , I am encountered the same issue. Do you mind explain more on why we can’t expect the same text content get exactly the same numbers in the embedding vector? Thank you so much!

Without the intention to take away from @wclayf 's potential answer, here is a good take on the subject by @curt.kennedy

You can further read up on potential workarounds that will also improve the workflow.

Hope this helps!

That non-deterministic behavior and rank flipping is now expected in the returned vectors.

Thing is though: there was no “best” in “nearest” in AI-powered semantic similarity. You’ll probably discover another model among dozens or hundreds that a human would evaluate better (but humans knowing an entire embedded corpus to judge is also a bit hard).

Typical is to use a “reranker” specialist embeddings on a top-k or top-budget initial exhaustive search result that is perhaps filtering to 10x greater than the chunk count, string length, or even input tokenization budgeted. Then you can plow the money in with populating upgraded full-dimension embeddings on demand, extending the tensors with more model calls for an averaging effect, or use of varied models with different learning. You can even ask a large input AI to pick 10 of 50 indexes against a query if they weren’t so biased in context input position.