I suppose 1% error in a specific dimension might seem alarming.
But it doesn’t alarm me much because the inference is off integrating many of these dimensions (nodes), because they are doing matrix multiplies (a series of dot products, hence some integration/noise reduction). So any individual node noise is reduced and shouldn’t overpower the result, especially with many nodes at play.
A 1% angular error could be a result of the scaling from near 0 (mentioned above). The computer is a finite lattice, and something near 0 is quantized, locks onto this near-origin lattice point, and then scales out with a 1% error on the unit hyper-sphere.
But, this component-wide noise is integrated out by a factor of 39 when using the dot product for comparison. So integrated error down to 0.00025641025641, assuming 1% 1-sigma, but I’m guessing the 1% is probably closer to 2-sigma in your observation (just picking the worst case, right?), so this is 0.000128205128205, hence the 5-ish decimal places of signal in the dot-product when comparing embeddings.
I don’t know. But my thinking is along the lines of … “Why does it matter?”.
If it’s deterministic at T=0, or not, why should we care?
Does this mean they are using unpredictable quantum fields in their generation or something, and this is the smoking gun?
PS Do you think 3.5-Turbo turned off ECC, hence “Turbo”?