Can text-embedding-ada-002 be made deterministic?

I’m using text-embedding-ada-002 for creating semantic embeddings from paragraphs of text.

However, each time I call the API with the same paragraph, I get slightly different vectors back. This is surprising, and actually not great, because it can generate unnecessary differences and non-determinism in downstream processes.

Is there some kind of temperature parameter I can set to make it be deterministic? (none is documented in the API endpoint docs.) The text doesn’t change, the semantic of the text doesn’t change, and the model doesn’t change, so … I should get the same answer!

2 Likes

When you perform similarity tests are they different?

I know you said this but I’m wondering what differences you’ve noticed

Just out of curiosity, what is the dot product of the two embedding vectors? I’m guessing it will be effectively 1.

1 Like

The dot product is very close to 1, yes.
But there are differences in about the fourth decimal digit, sometimes even in the third.
Because I check in the generated embeddings (they generate code that gets compiled into a binary) any difference generates unnecessary diffs.

For now, I’m living with a local text-hash-to-embedding-result cache which will return the same value for the exact same text, but I’m quite surprised that it’s not deterministic.

(Also, this might share root causes with the instances where we got NaN for some embeddings, maybe? Or maybe not, of those were just “service overloaded but can’t return status” errors.)

Also, interestingly, it doesn’t happen for all texts.

Here’s an example of the first 10 values difference:

                Embedding: []float32{
-                       0.00244294, 0.000817778, -0.00104908, -0.0208325, -0.0251986, 0.0265154, -0.00994501, -0.00182094, -0.0122251, -0.018102,
+                       0.00261806, 0.000738551, -0.00112797, -0.021, -0.0250752, 0.0266, -0.0100287, -0.00175173, -0.0121564, -0.0180475,
...

And when re-running and re-embedding, other input texts result in the variant outcome. I’d say maybe 3-5% of embeddings generated will vary between runs.

@RonaldGRuckus mentioned a GitHub repo a while back that mentioned the hidden/undocumented “encoding_format” parameter.

Which then points to this repo:

Maybe give that a shot? I know it’s undocumented and could be pulled at any time … haven’t tried it myself.

But based on setting this to “float” it seems to have helped, as mentioned over here:

2 Likes

Look at it this way…

Even if all if the vector values were off by 0.001, the dot product would be 0.999232. The largest different I see in your example is on the order of 0.0002. At a constant difference of 0.0005 the expected dot product is, 0.999808.

Honestly, I would expect your dot products with variation to be typically on the order of ~0.99998 or greater, and when looking at vector embeddings I’m pretty sure that’s as close enough to 1.0 as you’ll ever need to be.

Can I ask what problem you’re anticipating encountering with non-deterministic embeddings?

I assume you’re not constantly re-embedding all of your vectors?

1 Like

It is important that the embeddings are deterministic because it would change sort order, which leads to non determinism in a RAG prompt. I encountered the same issue. It is explained in the repo above, something to do with base64 issues. Just set encoding_format="float", when you call embeddings api and it will be fixed.

3 Likes

Thanks for confirming @lucasadams that the “encoding_format=float” still works.

As a general solution in the event they ever remove the secret/undocumented “encoding_format” parameter from the API, here is what I would do to solve it:

Use “DRY” or “Don’t Repeat Yourself”. Here, you have a database containing every chunk of text you have ever embedded, along with its embedding vector. This is indexed by the hash of the text. So same text == same hash. So if a chunk of text comes in, you first hash it and look for it in your database. If it exits, then use that for the embedding vector, otherwise create a new hash and embedding entry.

Guaranteed determinism!

Plus, if there are lots of repeats, this allows you to retrieve your embedding vector without using the API. This saves API costs, and increases your chances of surviving an API outage. Fringe benefits, I know, but still all upsides. Only downside is database costs and the development time for you to create this.

4 Likes

Not constantly, but it happens with some frequency as part of an automated (re-)build system. There’s many possible build bots, and each build is intended to be hermetic and deterministic, so I have not used some kind of remote database for the vectors – they are treated as source data. It used to be, build cache would go back out to openai for embeddings when the intermediate artifacts weren’t available or had aged out.

Even a single bit in difference, though, generates unwanted diffs in source control, a new bazel build, a new docker image build, a re-deployment, …

A single bit of difference also means that builds are not deterministic – depending on what the local cache would be on a particular build bot, I’d get different ordering between matches based on where the build artifact was generated.

The solution, as I indicated above, is to build a local cache, and check that cache into source control so the build system can reference this cache from wherever. (This also has the benefit of incurring less charges from OpenAI, but those charges have never been particularly onerous with my current usage.)
The particular git integration we use doesn’t like files larger than 20 MB, in turn, so this also requires some local on-disk sharding of the cache of embeddings, but that’s OK. The cache is ordered so as to generate minimal diffs when new values are added.

That is essentially what I ended up with, but the challenge there is that the database must be available to the distributed build system. But a source control system is a database, so, there we have it!

Hey! I used the encoding_format=“float” in the ada-002 api call and this didn’t work:
image
the vectors are still different. I don’t think it’s a huge problem as we are using the cosine similarity between vectors, but still. I’d like to get to the bottom of the issue.

Unfortunately at this time there is no way to make the embeddings deterministic.

There are several plausible theories as to why they aren’t, but none of them provide a way forward to eliminate the variance observed.

1 Like

Is it possible this is simply caused by the infamous floating point bug in all microprocessors? 3-5% seems a little high for it to be that, but embedding vectors are long decimals.

Is it possible that this is caused by selective availability, so that you can’t determine precisely when the running model is tweaked?

Birthday problem done on 10 string outputs from embeddings (no math)

text-similarity-ada-001: All outputs match
text-similarity-babbage-001: All outputs match
text-similarity-curie-001: All outputs match
text-similarity-davinci-001: All outputs match
text-search-ada-doc-001: All outputs match
text-search-ada-query-001: All outputs match
text-search-babbage-doc-001: All outputs match
text-search-babbage-query-001: All outputs match
text-search-curie-doc-001: All outputs match
text-search-curie-query-001: All outputs match
text-search-davinci-doc-001: All outputs match
text-search-davinci-query-001: All outputs match
code-search-ada-text-001: All outputs match
code-search-ada-code-001: All outputs match
code-search-babbage-code-001: All outputs match
code-search-babbage-text-001: All outputs match

and then…

text-embedding-ada-002: Mismatch found between 0 and 1
Comparison mismatch found at line number 8
Line in string 0: 0.013175653293728828,
Line in string 1: 0.013211685232818127,

text-embedding-ada-002: Mismatch found between 0 and 2
Comparison mismatch found at line number 8
Line in string 0: 0.013175653293728828,
Line in string 2: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 0 and 3
Comparison mismatch found at line number 8
Line in string 0: 0.013175653293728828,
Line in string 3: 0.013211685232818127,

Not even past the first value…

(expand for dozens more, and the only surprise is that a few match..)

text-embedding-ada-002: Mismatch found between 0 and 4
Comparison mismatch found at line number 8
Line in string 0: 0.013175653293728828,
Line in string 4: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 0 and 6
Comparison mismatch found at line number 8
Line in string 0: 0.013175653293728828,
Line in string 6: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 0 and 7
Comparison mismatch found at line number 8
Line in string 0: 0.013175653293728828,
Line in string 7: 0.013192304410040379,

text-embedding-ada-002: Mismatch found between 0 and 8
Comparison mismatch found at line number 8
Line in string 0: 0.013175653293728828,
Line in string 8: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 0 and 9
Comparison mismatch found at line number 8
Line in string 0: 0.013175653293728828,
Line in string 9: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 1 and 2
Comparison mismatch found at line number 8
Line in string 1: 0.013211685232818127,
Line in string 2: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 1 and 4
Comparison mismatch found at line number 8
Line in string 1: 0.013211685232818127,
Line in string 4: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 1 and 5
Comparison mismatch found at line number 8
Line in string 1: 0.013211685232818127,
Line in string 5: 0.013175653293728828,

text-embedding-ada-002: Mismatch found between 1 and 6
Comparison mismatch found at line number 8
Line in string 1: 0.013211685232818127,
Line in string 6: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 1 and 7
Comparison mismatch found at line number 8
Line in string 1: 0.013211685232818127,
Line in string 7: 0.013192304410040379,

text-embedding-ada-002: Mismatch found between 1 and 8
Comparison mismatch found at line number 8
Line in string 1: 0.013211685232818127,
Line in string 8: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 1 and 9
Comparison mismatch found at line number 8
Line in string 1: 0.013211685232818127,
Line in string 9: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 2 and 3
Comparison mismatch found at line number 8
Line in string 2: 0.013218246400356293,
Line in string 3: 0.013211685232818127,

text-embedding-ada-002: Mismatch found between 2 and 4
Comparison mismatch found at line number 8
Line in string 2: 0.013218246400356293,
Line in string 4: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 2 and 5
Comparison mismatch found at line number 8
Line in string 2: 0.013218246400356293,
Line in string 5: 0.013175653293728828,

text-embedding-ada-002: Mismatch found between 2 and 7
Comparison mismatch found at line number 8
Line in string 2: 0.013218246400356293,
Line in string 7: 0.013192304410040379,

text-embedding-ada-002: Mismatch found between 2 and 9
Comparison mismatch found at line number 8
Line in string 2: 0.013218246400356293,
Line in string 9: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 3 and 4
Comparison mismatch found at line number 8
Line in string 3: 0.013211685232818127,
Line in string 4: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 3 and 5
Comparison mismatch found at line number 8
Line in string 3: 0.013211685232818127,
Line in string 5: 0.013175653293728828,

text-embedding-ada-002: Mismatch found between 3 and 6
Comparison mismatch found at line number 8
Line in string 3: 0.013211685232818127,
Line in string 6: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 3 and 7
Comparison mismatch found at line number 8
Line in string 3: 0.013211685232818127,
Line in string 7: 0.013192304410040379,

text-embedding-ada-002: Mismatch found between 3 and 8
Comparison mismatch found at line number 8
Line in string 3: 0.013211685232818127,
Line in string 8: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 3 and 9
Comparison mismatch found at line number 8
Line in string 3: 0.013211685232818127,
Line in string 9: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 4 and 5
Comparison mismatch found at line number 8
Line in string 4: 0.013232099823653698,
Line in string 5: 0.013175653293728828,

text-embedding-ada-002: Mismatch found between 4 and 6
Comparison mismatch found at line number 8
Line in string 4: 0.013232099823653698,
Line in string 6: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 4 and 7
Comparison mismatch found at line number 8
Line in string 4: 0.013232099823653698,
Line in string 7: 0.013192304410040379,

text-embedding-ada-002: Mismatch found between 4 and 8
Comparison mismatch found at line number 8
Line in string 4: 0.013232099823653698,
Line in string 8: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 5 and 6
Comparison mismatch found at line number 8
Line in string 5: 0.013175653293728828,
Line in string 6: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 5 and 7
Comparison mismatch found at line number 8
Line in string 5: 0.013175653293728828,
Line in string 7: 0.013192304410040379,

text-embedding-ada-002: Mismatch found between 5 and 8
Comparison mismatch found at line number 8
Line in string 5: 0.013175653293728828,
Line in string 8: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 5 and 9
Comparison mismatch found at line number 8
Line in string 5: 0.013175653293728828,
Line in string 9: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 6 and 7
Comparison mismatch found at line number 8
Line in string 6: 0.013218246400356293,
Line in string 7: 0.013192304410040379,

text-embedding-ada-002: Mismatch found between 6 and 9
Comparison mismatch found at line number 8
Line in string 6: 0.013218246400356293,
Line in string 9: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 7 and 8
Comparison mismatch found at line number 8
Line in string 7: 0.013192304410040379,
Line in string 8: 0.013218246400356293,

text-embedding-ada-002: Mismatch found between 7 and 9
Comparison mismatch found at line number 8
Line in string 7: 0.013192304410040379,
Line in string 9: 0.013232099823653698,

text-embedding-ada-002: Mismatch found between 8 and 9
Comparison mismatch found at line number 8
Line in string 8: 0.013218246400356293,
Line in string 9: 0.013232099823653698,

So sure, it can be made deterministic… round(dimension, 3) if you’re optimistic.

This actually does not make the output deterministic! It just sweeps the problem under the rug.
The reason is that, once in a while, the small delta will be just enough to push it above/below the cut-off for rounding up or down.
The fewer digits you want, the lower the probability that any particular value will be just on that threshold, but the only way to get a probability of 0, is to round to 0 digits of precision.

I’m using a client-side cache of text-to-embedding-vector, which has the benefit of making our CI builds less flaky, because availability of OpenAI no longer counts into it.
The cache gets re-populated by cache misses on local developer machines, and new values checked in together with the changes in the source material text (which is also in the same monorepo.)
If CI builds detect a cache miss, that’s treated as a test failure, just like if re-running the code formatter on the code would generate a diff.
This is deterministic enough for us but it still leaves Ada embeddings non-deterministic.

And that is the humor. Round .0045000000000 +/- anything. How bizarre the model is.

1 Like

I was hit by this as well… Doing cosine proximity of a fixed set of documents embeddings vectors against the same user question… I had some documents going up and above my theshsold proximity… The only thing varying was for sure the embedding of the user question done on the fly…
I “fixed” this raising a little bit my threshold…
But I did not see random response… Or I got a set of cosine proximity (against the whole set of documents embedding) or another one.
As if it was dependent on the backend server hit…

(BTW, using Azure open ai)