I’ve seen similar Topics created regarding the non-deterministic nature of text embeddings, but all existing cases I’ve found only find small differences measured in cosine-similarity (dot product) between two runs.
I have encountered a case where the cosine-similarity between two consecutive runs somewhat consistently ends up being 0.9682875, which is a fair bit from 1. This had us troubleshooting for a while, looking for bugs, as we didn’t think any rounding errors or similar would give such big differences.
Running this code:
text = "War Stories On The Path To Least Privilege"
from openai import OpenAI
openai = OpenAI()
r1 = openai.embeddings.create(
model="text-embedding-3-small",
input=text,
encoding_format="base64",
)
r1_embedding = np.frombuffer(base64.b64decode(r1.data[0].embedding), dtype=np.float32)
r2 = openai.embeddings.create(
model="text-embedding-3-small",
input=text,
encoding_format="base64",
)
r2_embedding = np.frombuffer(base64.b64decode(r2.data[0].embedding), dtype=np.float32)
print(r2.data[0].embedding == r1.data[0].embedding)
print(np.linalg.norm(r1_embedding - r2_embedding))
print(np.dot(r1_embedding, r2_embedding))
Will for us around 25% of the time print this:
False
0.2518432
0.9682875
Other times the vectors are either identical or have a cosign-similarity of 0.9999994.
This also seems to happen with encoding_format="float".
I’ve read in other topics that the non-determinism in text embeddings might stem from the order of floating point calculations on fast-clocked GPUs not being guaranteed due to values “arriving out of order”.
I wonder if there is something that can be done to avoid this, or mitigate the problem in any way? This one case found in our dataset is likely an unusually unlucky scenario, but the dataset isn’t enormous.
I found much closer semantic results with the full dimensionality of 3-large.
Where it really matters is in searching against “threshold” results. (You can precheck for identity with a hash for free.)
A symptom then would be the possibility of ranks flipping position - the search results moving around.
=== Semantic Similarity Stability Report ===
Model used: text-embedding-3-small
Query text: 'How does artificial intelligence affect society?'
Paragraph text: 'Artificial intelligence has profound effects on society, influencing employment, privacy, decision-making, and economic productivity across various industries.'
Number of trials: 100
Interval between calls: 0.100s
Embedding dimension: 1536
Calculation is cosine similarity at float64. Full dimensionality and base64 unpacking into np.float32.
=============================================
Your diagnostic - and your application of embeddings
I asked gpt-4.5 to carefully diagnose the provided snippet for faults or errors that could lead to significant apparent variance (e.g., ~25%) when repeatedly embedding the exact same input with OpenAI’s text-embedding-3-small model, compared to my results which showed extremely low variance (cosine similarities ≥ 0.999997).
embedding fields from OpenAI’s API in "base64" encoding are returned as base64-encoded strings, not directly as numeric arrays. Therefore, using == on base64 strings directly checks for exact string equality. Any minimal binary encoding difference (even insignificant at the numeric level, such as floating-point encoding variance) might lead to differences in base64 encoding—thus frequently returning False. This only tells us what we know.
Important:
Two identical floating-point arrays can encode slightly differently at the byte-level representation. It’s a known phenomenon that minor numerical fluctuations or floating-point rounding from the API could lead to differing base64 strings, even if numeric values are nearly identical.
Issue #2:
Incorrect method of measuring similarity:
np.dot(r1_embedding, r2_embedding) alone is not cosine similarity. Without normalization (division by the product of the vectors’ norms), this is simply a dot product—a quantity heavily dependent on vector magnitudes.
Using just np.linalg.norm(r1_embedding - r2_embedding) is also misleading as a similarity measure; it captures raw distance, but not normalized semantic similarity (cosine similarity).
Because embeddings from OpenAI models are normalized (norm ~1), cosine similarity is the meaningful metric. Using unnormalized dot products or raw Euclidean differences can exaggerate perceived variance.
Step-by-Step Corrective Analysis:
To properly compare embeddings and accurately measure variance:
Decode base64 to numpy arrays (correctly done in snippet).
Normalize embeddings explicitly or measure cosine similarity using standard formula.
Avoid direct string equality (==) on base64 encoded embeddings.
Understand that minor floating-point precision differences will occur.
Corrected and Recommended Testing Code:
Use the corrected approach to properly diagnose similarity. This code properly computes cosine similarity between embeddings:
import numpy as np
import base64
from openai import OpenAI
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
openai = OpenAI()
text = "War Stories On The Path To Least Privilege"
r1 = openai.embeddings.create(
model="text-embedding-3-small",
input=text,
encoding_format="base64",
)
r1_embedding = np.frombuffer(base64.b64decode(r1.data[0].embedding), dtype=np.float32)
r2 = openai.embeddings.create(
model="text-embedding-3-small",
input=text,
encoding_format="base64",
)
r2_embedding = np.frombuffer(base64.b64decode(r2.data[0].embedding), dtype=np.float32)
cos_sim = cosine_similarity(r1_embedding, r2_embedding)
print(f"Cosine similarity: {cos_sim:.8f}")
print(f"Difference norm: {np.linalg.norm(r1_embedding - r2_embedding):.8f}")
print(f"Embedding 1 norm: {np.linalg.norm(r1_embedding):.8f}")
print(f"Embedding 2 norm: {np.linalg.norm(r2_embedding):.8f}")
This should match your results (near-perfect consistency).
Likely Cause of Reported “25%” Variance:
Incorrect embedding equality check (==) on base64 strings.
Different binary floating-point representations, even minor, produce entirely different base64 strings, hence always False.
Misuse of dot-product as similarity measure, without normalization.
The printed value 0.9682875 is not cosine similarity—it’s just an unnormalized dot product, misleadingly lower due to embedding magnitudes and scale.
Using Euclidean distance (np.linalg.norm(r1_embedding - r2_embedding)) alone without context exaggerates perceived differences (e.g., 0.2518432). Without cosine normalization, this raw difference is not interpretable as semantic similarity.
Thus, no true “fault” in embedding model API output, but faulty interpretation and incorrect metric usage.
Final Advice and Solution:
Always use cosine similarity when comparing embeddings for consistency checks.
Never directly compare embeddings as base64 strings (==).
The provided faulty code snippet’s observed “25%” variance is purely symptomatic of incorrect similarity measurement and embedding comparison.
No underlying API fault or inherent variance in embedding model of that magnitude exists.
By applying the corrected approach provided above, your results will confirm near-perfect consistency and correctly align with your initial, accurate observations (cosine similarity ≈ 0.999997 to 1.000000).
This whole reply looks LLM-generated, and arrives at an incorrect conclusion.
I’ll explain how/why for anyone who might find this and risk being misled by it in the future:
The base64 encoding comparisons and euclidean distance outputs are only there to demonstrate the immediate difference in results between runs. The cosine similarity output is not affected by them.
Pertaining to the cosine similarity calculation itself, the code relies on the fact that embeddings returned by openai are already normalized, which is actually mentioned in the LLM response itself:
Because embeddings from OpenAI models are normalized (norm ~1), cosine similarity is the meaningful metric.
So while the statement
np.dot(r1_embedding, r2_embedding) alone is not cosine similarity. Without normalization (division by the product of the vectors’ norms), this is simply a dot product—a quantity heavily dependent on vector magnitudes.
is true for the general case of any two vectors, in this particular case the dot product is the cosine similarity.
This can be confirmed by running the original code but with the following modified last line