The article states that Azure always returns the same results, so maybe that’s a better solution?
This article is about OpenAI Embeddings being different and is raising a bug.
But, I don’t think this will go anywhere. I’m feeling a complete lacking of pragmatism so we can hold hands and solve this together. Have you ever looked closer at the vector values and counted how many digits they have?
There’s no reason to be rude here, the example was done with float32 anyway. I understand what you’re saying but a 0.001 difference for the same text is not an acceptable noise or floating point rounding but is an issue.
Sure if you’re doing toy examples rounding is fine, but any production grade system this wouldn’t be considered acceptable.
Seeing something?
Yes. The python library returns extra digits . So you are safe to either round it yourself, or just deal with the inherent noise that really doesn’t make any difference.
The example was done with float32, so it doesn’t consider extra digits.
I guess my concern from this example is it was missing the issue as most of the time the embedding returned is the same. This is just floating multiplication i.e. np.dot(lolz[0], lolz[0]) would give the same distance in this example.
What @RonaldGRuckus is talking about is the dot product, which for unit vectors is the same as cosine similarity.
Thanks for the correction Curt, when I take cosine similarity with a different example. The difference vector is still off but the magnitude is lower.
If you’ve been paying attention, the new OpenAI API has you now specify “float” or “base64” as the encoding type. float is represented rounded near single-precision (and no exponents showing up).
float:
embed = client.embeddings.with_raw_response.create(...)
gen = embed.http_response.iter_bytes()
for i in gen:
print(i, end="")
Of course, because embeddings is non-deterministic, absolutely no way to verify precision is useful. I only see I didn’t do it wrong. A float is about seven significant figure digits.
The reason I mentioned it though is because in his example he was using the longer embedding examples [-0.025058425962924957, -0.019388560205698013 ...
Yes, the first 5 or so decimal places in the dot product can be considered signal, and the remaining trailing digits should be considered noise.
The key is using dot product. Not magnitude of the delta vector.
The reasoning for the randomness is how these embeddings get generated. The latest theory is they take the last hidden layer in the embedding model right after the last token is generated, and then scale this to a unit vector.
So if “hello” is close to the origin (close to 0 in magnitude), then scaling it back out will introduce error.
But what introduces randomness is that the GPU’s are clocked so fast the numbers arrive at the multiply/add stage at different times for different runs, and because floating point arithmetic is not associative, you get these random variations.
To “fix” this randomness, they would have to slow down the GPU clocks and synchronize them in a specific order, but this would be a 10x slowdown, and nobody wants that.
Looks like they may have added to the API recently. I remember “float” (or something similar) was an undocumented hidden option for a while.
I’d be curious though if it even matters in the dot product though. I’m thinking the dot product smoothes out any variation between base64 or float versions.
If I remember my stats correctly, isn’t the dot product reducing the sigma by a factor of \sqrt{1536}\approx39? Assuming some IID Gaussian on the errors in each component of the embedding vector.
The article linked above, while having some flaws (assuming all language models are non-deterministic, assuming temperature=0 is greedy sampling, assumptions about utility), they made the API calls. Result: Azure’s ada is deterministic and returns identical values.
Now: was ada embeddings or gpt-3.5 EVER deterministic? Was this some optimization OpenAI decided to hit their same-name models with at some point, among many other alterations that affect output speed and quality? One not worth backporting to GPT-3?
I’m talking about individual dimensions, and the worst effects.
For example, a harder test is to make gpt-3.5-turbo (instruct, because that’s all we get) produce different top tokens with top-p = 1e-9. Which I can do, and which has continued to take less and less generation to hit. Those worst-case logits where the second place becomes the first are over 2% difference.
Then apply not the noisy score, but to the worst case of underlying individual values.
1% more in the right semantic dimension when you are characterizing hate speech is randomly banning accounts.
I’ll ponder doing some scripting using these 32 bit floats given by base64.
I suppose 1% error in a specific dimension might seem alarming.
But it doesn’t alarm me much because the inference is off integrating many of these dimensions (nodes), because they are doing matrix multiplies (a series of dot products, hence some integration/noise reduction). So any individual node noise is reduced and shouldn’t overpower the result, especially with many nodes at play.
A 1% angular error could be a result of the scaling from near 0 (mentioned above). The computer is a finite lattice, and something near 0 is quantized, locks onto this near-origin lattice point, and then scales out with a 1% error on the unit hyper-sphere.
But, this component-wide noise is integrated out by a factor of 39 when using the dot product for comparison. So integrated error down to 0.00025641025641, assuming 1% 1-sigma, but I’m guessing the 1% is probably closer to 2-sigma in your observation (just picking the worst case, right?), so this is 0.000128205128205, hence the 5-ish decimal places of signal in the dot-product when comparing embeddings.
I don’t know. But my thinking is along the lines of … “Why does it matter?”.
If it’s deterministic at T=0, or not, why should we care?
Does this mean they are using unpredictable quantum fields in their generation or something, and this is the smoking gun?
PS Do you think 3.5-Turbo turned off ECC, hence “Turbo”?
Hi, im a little bit late here. Not my intentation to generate any discussion but I would like to know if you quantified the percentual change in the dot product of two embeddings generated for the same inputs.
So say, a = normalized_embedding(text1) and b = normalized_embedding(text2) the (a,b) = mean(a,b) +/- sigma. How large is sigma?