Why `OpenAI Embedding` return different vectors for the same text input?

The article states that Azure always returns the same results, so maybe that’s a better solution?

This article is about OpenAI Embeddings being different and is raising a bug.

But, I don’t think this will go anywhere. I’m feeling a complete lacking of pragmatism so we can hold hands and solve this together. Have you ever looked closer at the vector values and counted how many digits they have?

There’s no reason to be rude here, the example was done with float32 anyway. I understand what you’re saying but a 0.001 difference for the same text is not an acceptable noise or floating point rounding but is an issue.
Sure if you’re doing toy examples rounding is fine, but any production grade system this wouldn’t be considered acceptable.

Seeing something?

Yes. The python library returns extra digits :raised_hands:. So you are safe to either round it yourself, or just deal with the inherent noise that really doesn’t make any difference.

The example was done with float32, so it doesn’t consider extra digits.

I think at this point you’re arguing for the sake of arguing. I may be wrong but it is feeling more and more that way.

Did you not look at my above example? It’s not a 0.001 difference.

Embeddings vector values in most text embedding models, and OpenAI embedding models are float32.

At this point I think you’re a troll.

For ada-002 you should be using the dot product, not the magnitude of the difference vector.

What @RonaldGRuckus is talking about is the dot product, which for unit vectors is the same as cosine similarity.

This is the integral of 1536 numbers, which smoothes out any errors that you are seeing in the vector magnitudes.

So when comparing with the dot product, you shouldn’t see such a dramatic difference.

1 Like

Eh not trolling and my responses are genuine.

image

I guess my concern from this example is it was missing the issue as most of the time the embedding returned is the same. This is just floating multiplication i.e. np.dot(lolz[0], lolz[0]) would give the same distance in this example.

What @RonaldGRuckus is talking about is the dot product, which for unit vectors is the same as cosine similarity.

Thanks for the correction Curt, when I take cosine similarity with a different example. The difference vector is still off but the magnitude is lower.

>>> embeddings = client.embeddings.create(model="text-embedding-ada-002", input=["hello" for _ in range(3)])
>>> for x in embeddings.data:
...     print(np.array(x.embedding[:5], dtype="float32"))
... 
[-0.02505843 -0.01938856 -0.02778102 -0.0309794  -0.02468836]
[-0.02501037 -0.01939525 -0.0277981  -0.03099541 -0.02470649]
[-0.02505843 -0.01938856 -0.02778102 -0.0309794  -0.02468836]
>>> base_emb = np.array(embeddings.data[0].embedding, dtype="float32")
>>> diff_emb = np.array(embeddings.data[1].embedding, dtype="float32")
>>> same_emb = np.array(embeddings.data[2].embedding, dtype="float32")
>>> np.dot(base_emb, same_emb)
0.9999999
>>> np.dot(base_emb, diff_emb)
0.99999887

If this is considered acceptable so be it, but this is just one particular example and I’m guessing the noise varies.

1 Like

If you’ve been paying attention, the new OpenAI API has you now specify “float” or “base64” as the encoding type. float is represented rounded near single-precision (and no exponents showing up).

float:

embed = client.embeddings.with_raw_response.create(...)
gen = embed.http_response.iter_bytes()
for i in gen:
    print(i, end="")

b’{\n “object”: “list”,\n “data”: [\n {\n “object”: “embedding”,\n “index”: 0,\n “embedding”: [\n -0.034035724,\n 0.0011407488,\n -0.0015808099,\n -0.0048585343,\n -0.02370809,\n 0.020018721,\n 0.003345114,\n -0.0042154933,\n 0.004186264,\n -0.00992492,\n 0.020811155,\n 0.022084247,\n

It took a bit of head scratching and bot talk to decode the base64, which is four-byte floats.

[-0.03403572365641594, 0.0011407488491386175, -0.0015808099415153265, -0.004858534317463636, -0.02370809018611908, 0.020018720999360085, 0.0033451139461249113, -0.004215493332594633, 0.004186263773590326, -0.00992492027580738, 0.020811155438423157, 0.022084247320890427, -0.0006101585458964109, -0.02298060804605484,

Of course, because embeddings is non-deterministic, absolutely no way to verify precision is useful. I only see I didn’t do it wrong. A float is about seven significant figure digits.

1 Like

Admittedly no I haven’t.

The reason I mentioned it though is because in his example he was using the longer embedding examples
[-0.025058425962924957, -0.019388560205698013 ...

Yes, the first 5 or so decimal places in the dot product can be considered signal, and the remaining trailing digits should be considered noise.

The key is using dot product. Not magnitude of the delta vector.

The reasoning for the randomness is how these embeddings get generated. The latest theory is they take the last hidden layer in the embedding model right after the last token is generated, and then scale this to a unit vector.

So if “hello” is close to the origin (close to 0 in magnitude), then scaling it back out will introduce error.

But what introduces randomness is that the GPU’s are clocked so fast the numbers arrive at the multiply/add stage at different times for different runs, and because floating point arithmetic is not associative, you get these random variations.

To “fix” this randomness, they would have to slow down the GPU clocks and synchronize them in a specific order, but this would be a 10x slowdown, and nobody wants that. :rofl:

2 Likes

The reason I mentioned it though is because in his example he was using the longer embedding examples
[-0.025058425962924957, -0.019388560205698013

@RonaldGRuckus Sorry it wasn’t clear, it was printing the embedding as float64 but casted to float32 when putting it in a numpy array.

Thanks @curt.kennedy, appreciate the summary.

1 Like

Looks like they may have added to the API recently. I remember “float” (or something similar) was an undocumented hidden option for a while.

I’d be curious though if it even matters in the dot product though. I’m thinking the dot product smoothes out any variation between base64 or float versions.

If I remember my stats correctly, isn’t the dot product reducing the sigma by a factor of \sqrt{1536}\approx39? Assuming some IID Gaussian on the errors in each component of the embedding vector.

GPT-3 runs on GPUs…with no +/- 1% per run

The article linked above, while having some flaws (assuming all language models are non-deterministic, assuming temperature=0 is greedy sampling, assumptions about utility), they made the API calls. Result: Azure’s ada is deterministic and returns identical values.

Now: was ada embeddings or gpt-3.5 EVER deterministic? Was this some optimization OpenAI decided to hit their same-name models with at some point, among many other alterations that affect output speed and quality? One not worth backporting to GPT-3?

I’m not sure what you mean by no +/- 1% for GPT-3.

What thing here has 1% error?

The error vector magnitude (EVM) above is only 0.1% (=0.001, not 0.01)

The idea is that compared to other items the difference is so great that the scale you are concerned about doesn’t matter.

I’m talking about individual dimensions, and the worst effects.

For example, a harder test is to make gpt-3.5-turbo (instruct, because that’s all we get) produce different top tokens with top-p = 1e-9. Which I can do, and which has continued to take less and less generation to hit. Those worst-case logits where the second place becomes the first are over 2% difference.

Then apply not the noisy score, but to the worst case of underlying individual values.

1% more in the right semantic dimension when you are characterizing hate speech is randomly banning accounts.

I’ll ponder doing some scripting using these 32 bit floats given by base64.

I suppose 1% error in a specific dimension might seem alarming.

But it doesn’t alarm me much because the inference is off integrating many of these dimensions (nodes), because they are doing matrix multiplies (a series of dot products, hence some integration/noise reduction). So any individual node noise is reduced and shouldn’t overpower the result, especially with many nodes at play.

A 1% angular error could be a result of the scaling from near 0 (mentioned above). The computer is a finite lattice, and something near 0 is quantized, locks onto this near-origin lattice point, and then scales out with a 1% error on the unit hyper-sphere.

But, this component-wide noise is integrated out by a factor of 39 when using the dot product for comparison. So integrated error down to 0.00025641025641, assuming 1% 1-sigma, but I’m guessing the 1% is probably closer to 2-sigma in your observation (just picking the worst case, right?), so this is 0.000128205128205, hence the 5-ish decimal places of signal in the dot-product when comparing embeddings.

I don’t know. But my thinking is along the lines of … “Why does it matter?”.

If it’s deterministic at T=0, or not, why should we care?

Does this mean they are using unpredictable quantum fields in their generation or something, and this is the smoking gun? :rofl:

PS Do you think 3.5-Turbo turned off ECC, hence “Turbo”? :face_with_monocle:

1 Like

:smiley: Naaaaaa, 2%ish gains… not really turbo.

1 Like

Hi, im a little bit late here. Not my intentation to generate any discussion but I would like to know if you quantified the percentual change in the dot product of two embeddings generated for the same inputs.
So say, a = normalized_embedding(text1) and b = normalized_embedding(text2) the (a,b) = mean(a,b) +/- sigma. How large is sigma?