I’ve been using embeddings for a while (ada and now “text-embedding-3-small”) and have noticed that the exact same word produces small variations in the vector embeddings output.
This is an example showing the first 5 rows of vector embeddings data from two repetitions of the query “friction” to the model “text-embedding-3-small”:
“friction”
“friction”
Diff
-0.009956564
-0.009942042
-1.4522E-05
0.024430862
0.024431106
-2.44E-07
-0.009832289
-0.009817766
-1.4523E-05
-0.02318812
-0.02318835
2.3E-07
0.009817668
0.009832387
-1.4719E-05
As seen in the table there are slight differences (diff column is non zero). I don’t know how much this matters in practice (depends on use case), but it would be nice if it was possible to set the temperature and seed to get repeatable output data.
Anyone knows if it is possible to control temp for embeddings?
The embeddings endpoint doesn’t have softmax, doesn’t generate tokens, doesn’t sample from token probabilities, so no, it does not have a temperature parameter.
You observe non-determinism in the AI model. Subsequent runs have variations in the tensor math results. The same is seen across all current OpenAI models - slight variations in outputs between runs.
The significant figures of these vector value variations is lower than the actual AI quality of similarity search rankings, so it doesn’t matter too much.
Just to add to what @_j wrote: you will actually see variations in chat models (non embedding ones) even when you set temperature to zero and set a seed value. For example, you will see variations in logprobs on tokens, and therefore outputted tokens themselves. As Jay mentioned - there is massive amount of infra on OpenAI side, and all the slight precision differences in calculations actually add up!
As an example, here are logits from a local model (Anthropic BSides CFT model). I’m sure there are machine precision errors eventually but they are not normally visible when inspecting the logits. At least the first 5 digits of the output logits are always the same in consecutive queries.
Run 1:
tensor([[[-7.5829e-02, 1.3152e-02, -5.3292e-02, -5.5444e-02, 2.0464e-02,
Run 2:
tensor([[[-7.5829e-02, 1.3152e-02, -5.3292e-02, -5.5444e-02, 2.0464e-02,
So for the embeddings it is much more noticable. But I guess I will have to live with it, just not used to seeing that kind of variation…
The local models will be much more stable in this regard, because it’s executing on single/same hardware with a single instance. The variability comes from a massive infrastructure that consists of many instances of these models spread over thousands of GPUs, and slight deviations in precisions add up.
GPT-3 models and their embeddings counterpart had no such behavior, would always return identical logprobs for identical inputs until shutoff, and ran alongside chat models with the symptom.
Therefore, the non-repeatability cannot be directly attributed to simply datacenter deployment or “fingerprint” they return with little utility. OpenAI has never answered what is the cause, if it is an effect of architecture or optimization or if they were even attempting to make outputs fuzzy for less discoverability of technology (before complete ability to reveal embeddings size and underlying model parameter count was published.)
It’s caused by the asynchronous timing in the GPU’s. IEEE floats do not obey the distributive law. The curse of massive parallelism. But the benefits, massive speed, outweigh any non-determinism, since it’s in the noise anyway.
You can always do multiple requests and average them to make it more deterministic though…
Or add them all and have a wider representation of the word. Which may be a good thing or may not (because it could create relations on one part of the space but not on the other parts of the same word)…
You really don’t need to make multiple calls to counter any symptom. Use your multiple calls for other semantic search techniques.
=== Non-Determinism Report for model=text-embedding-3-large ===
Input text: 'Artificial intelligence is changing the world.'
Number of trials: 10
Embedding dimension: 3072
Pairwise similarities across 45 pairs:
Min: 0.999997
Max: 1.000000
Mean: 0.999999
Median: 0.999999
StDev: 0.000001
Average similarity of each run’s vector to the ‘mean vector’: 1.000000
Mean dimension-wise stdev: 0.000015
Max dimension-wise stdev: 0.000071
Had to do some optimizations on a database lately - something with millions of entries every day… and after I was done the CPU load was reduced to 50%. That was when they found out they had accidently deployed the database to a stage server with a single core
Had no shell access… otherwise htop would have been my first step obviously. So all just stored procedures and index optimization and stuff…
back to the topic - how not to use db relations in combination with embeddings will always be strange to me.
I think most folks don’t interlink DB’s and embeddings because it adds a layer of complexity.
But any good embedding service should have the strings behind the embeddings in a DB and the embedding vectors in memory (need both).
So the embedding searching is done in memory for speed, and the DB is leveraged to retrieve the text (strings) from the indices corresponding to the top embeddings.
If you are going through all this work, might as well throw in a cheap call to look up the string hash in the DB before embedding it.
It’s fast, inexpensive, and might save you something like 20-30% off the embedding costs, depending on how repetitive you incoming queries are, of course.
The more repetitive, the more savings! (and lower latencies!)
Maybe on a pure level it would make sense to separate the two but interlinking is definitely beneficial when you are switching and intertwining DB & semantic look ups
Yeah, my version works too. Fits in a little AWS Lambda function, small footprint, easy to understand, fast, etc. Just depends on how you plan on deploying I guess.
Surprised you don’t have your own Rust version out there @RonaldGRuckus