I’m using text-embedding-ada-002 for creating semantic embeddings from paragraphs of text.
However, each time I call the API with the same paragraph, I get slightly different vectors back. This is surprising, and actually not great, because it can generate unnecessary differences and non-determinism in downstream processes.
Is there some kind of temperature parameter I can set to make it be deterministic? (none is documented in the API endpoint docs.) The text doesn’t change, the semantic of the text doesn’t change, and the model doesn’t change, so … I should get the same answer!
The dot product is very close to 1, yes.
But there are differences in about the fourth decimal digit, sometimes even in the third.
Because I check in the generated embeddings (they generate code that gets compiled into a binary) any difference generates unnecessary diffs.
For now, I’m living with a local text-hash-to-embedding-result cache which will return the same value for the exact same text, but I’m quite surprised that it’s not deterministic.
(Also, this might share root causes with the instances where we got NaN for some embeddings, maybe? Or maybe not, of those were just “service overloaded but can’t return status” errors.)
Also, interestingly, it doesn’t happen for all texts.
Even if all if the vector values were off by 0.001, the dot product would be 0.999232. The largest different I see in your example is on the order of 0.0002. At a constant difference of 0.0005 the expected dot product is, 0.999808.
Honestly, I would expect your dot products with variation to be typically on the order of ~0.99998 or greater, and when looking at vector embeddings I’m pretty sure that’s as close enough to 1.0 as you’ll ever need to be.
Can I ask what problem you’re anticipating encountering with non-deterministic embeddings?
I assume you’re not constantly re-embedding all of your vectors?
It is important that the embeddings are deterministic because it would change sort order, which leads to non determinism in a RAG prompt. I encountered the same issue. It is explained in the repo above, something to do with base64 issues. Just set encoding_format="float", when you call embeddings api and it will be fixed.
Thanks for confirming @lucasadams that the “encoding_format=float” still works.
As a general solution in the event they ever remove the secret/undocumented “encoding_format” parameter from the API, here is what I would do to solve it:
Use “DRY” or “Don’t Repeat Yourself”. Here, you have a database containing every chunk of text you have ever embedded, along with its embedding vector. This is indexed by the hash of the text. So same text == same hash. So if a chunk of text comes in, you first hash it and look for it in your database. If it exits, then use that for the embedding vector, otherwise create a new hash and embedding entry.
Plus, if there are lots of repeats, this allows you to retrieve your embedding vector without using the API. This saves API costs, and increases your chances of surviving an API outage. Fringe benefits, I know, but still all upsides. Only downside is database costs and the development time for you to create this.
Not constantly, but it happens with some frequency as part of an automated (re-)build system. There’s many possible build bots, and each build is intended to be hermetic and deterministic, so I have not used some kind of remote database for the vectors – they are treated as source data. It used to be, build cache would go back out to openai for embeddings when the intermediate artifacts weren’t available or had aged out.
Even a single bit in difference, though, generates unwanted diffs in source control, a new bazel build, a new docker image build, a re-deployment, …
A single bit of difference also means that builds are not deterministic – depending on what the local cache would be on a particular build bot, I’d get different ordering between matches based on where the build artifact was generated.
The solution, as I indicated above, is to build a local cache, and check that cache into source control so the build system can reference this cache from wherever. (This also has the benefit of incurring less charges from OpenAI, but those charges have never been particularly onerous with my current usage.)
The particular git integration we use doesn’t like files larger than 20 MB, in turn, so this also requires some local on-disk sharding of the cache of embeddings, but that’s OK. The cache is ordered so as to generate minimal diffs when new values are added.
That is essentially what I ended up with, but the challenge there is that the database must be available to the distributed build system. But a source control system is a database, so, there we have it!