Does Embeddings have temperature?

Hello

I’ve been using embeddings for a while (ada and now “text-embedding-3-small”) and have noticed that the exact same word produces small variations in the vector embeddings output.

I can’t find anything in the docs (https://platform.openai.com/docs/guides/embeddings/?lang=python) about controlling temperature, so I was expecting the default to be temp=0 and identical outputs from the same query.

This is an example showing the first 5 rows of vector embeddings data from two repetitions of the query “friction” to the model “text-embedding-3-small”:

“friction” “friction” Diff
-0.009956564 -0.009942042 -1.4522E-05
0.024430862 0.024431106 -2.44E-07
-0.009832289 -0.009817766 -1.4523E-05
-0.02318812 -0.02318835 2.3E-07
0.009817668 0.009832387 -1.4719E-05

As seen in the table there are slight differences (diff column is non zero). I don’t know how much this matters in practice (depends on use case), but it would be nice if it was possible to set the temperature and seed to get repeatable output data.

Anyone knows if it is possible to control temp for embeddings?

The embeddings endpoint doesn’t have softmax, doesn’t generate tokens, doesn’t sample from token probabilities, so no, it does not have a temperature parameter.

You observe non-determinism in the AI model. Subsequent runs have variations in the tensor math results. The same is seen across all current OpenAI models - slight variations in outputs between runs.

The significant figures of these vector value variations is lower than the actual AI quality of similarity search rankings, so it doesn’t matter too much.

friction_1 friction_2 Diff Percentage Variance
-0.00995656 -0.00994204 -1.4522e-05 0.145854
0.0244309 0.0244311 -2.44e-07 -0.000998737
-0.00983229 -0.00981777 -1.4523e-05 0.147707
-0.0231881 -0.0231883 2.3e-07 -0.000991887
0.00981767 0.00983239 -1.4719e-05 -0.149924
4 Likes

Hi @Phinder !

Just to add to what @_j wrote: you will actually see variations in chat models (non embedding ones) even when you set temperature to zero and set a seed value. For example, you will see variations in logprobs on tokens, and therefore outputted tokens themselves. As Jay mentioned - there is massive amount of infra on OpenAI side, and all the slight precision differences in calculations actually add up!

1 Like

Thanks for the inputs!

As an example, here are logits from a local model (Anthropic BSides CFT model). I’m sure there are machine precision errors eventually but they are not normally visible when inspecting the logits. At least the first 5 digits of the output logits are always the same in consecutive queries.

Run 1:
tensor([[[-7.5829e-02, 1.3152e-02, -5.3292e-02, -5.5444e-02, 2.0464e-02,

Run 2:
tensor([[[-7.5829e-02, 1.3152e-02, -5.3292e-02, -5.5444e-02, 2.0464e-02,

So for the embeddings it is much more noticable. But I guess I will have to live with it, just not used to seeing that kind of variation…

1 Like

The local models will be much more stable in this regard, because it’s executing on single/same hardware with a single instance. The variability comes from a massive infrastructure that consists of many instances of these models spread over thousands of GPUs, and slight deviations in precisions add up.

I wrote a bit more about this variability here.

2 Likes

GPT-3 models and their embeddings counterpart had no such behavior, would always return identical logprobs for identical inputs until shutoff, and ran alongside chat models with the symptom.

Therefore, the non-repeatability cannot be directly attributed to simply datacenter deployment or “fingerprint” they return with little utility. OpenAI has never answered what is the cause, if it is an effect of architecture or optimization or if they were even attempting to make outputs fuzzy for less discoverability of technology (before complete ability to reveal embeddings size and underlying model parameter count was published.)

3 Likes

It’s caused by the asynchronous timing in the GPU’s. IEEE floats do not obey the distributive law. The curse of massive parallelism. But the benefits, massive speed, outweigh any non-determinism, since it’s in the noise anyway.

2 Likes

You can always do multiple requests and average them to make it more deterministic though…

Or add them all and have a wider representation of the word. Which may be a good thing or may not (because it could create relations on one part of the space but not on the other parts of the same word)…

You really don’t need to make multiple calls to counter any symptom. Use your multiple calls for other semantic search techniques.

=== Non-Determinism Report for model=text-embedding-3-large ===
Input text: 'Artificial intelligence is changing the world.'
Number of trials: 10
Embedding dimension: 3072
Pairwise similarities across 45 pairs:
  Min:    0.999997
  Max:    1.000000
  Mean:   0.999999
  Median: 0.999999
  StDev:  0.000001

Average similarity of each run’s vector to the ‘mean vector’: 1.000000
Mean dimension-wise stdev: 0.000015
Max dimension-wise stdev:  0.000071

The only way to reduce the noise is de-dupe at the string level … so same string gets same prior embedding.

So cacheing everything basically.

Helps reduce embedding costs and latencies over time too.

But DB costs go up. (oh well)

1 Like

Had to do some optimizations on a database lately - something with millions of entries every day… and after I was done the CPU load was reduced to 50%. That was when they found out they had accidently deployed the database to a stage server with a single core :joy:
Had no shell access… otherwise htop would have been my first step obviously. So all just stored procedures and index optimization and stuff…

back to the topic - how not to use db relations in combination with embeddings will always be strange to me.

I think most folks don’t interlink DB’s and embeddings because it adds a layer of complexity.

But any good embedding service should have the strings behind the embeddings in a DB and the embedding vectors in memory (need both).

So the embedding searching is done in memory for speed, and the DB is leveraged to retrieve the text (strings) from the indices corresponding to the top embeddings.

If you are going through all this work, might as well throw in a cheap call to look up the string hash in the DB before embedding it.

It’s fast, inexpensive, and might save you something like 20-30% off the embedding costs, depending on how repetitive you incoming queries are, of course.

The more repetitive, the more savings! (and lower latencies!)

1 Like

Weaviate be like “Am I nothing to you?”.

Maybe on a pure level it would make sense to separate the two but interlinking is definitely beneficial when you are switching and intertwining DB & semantic look ups

1 Like

The papers and theory behind Weaviate are fun to replicate, but if you have the time and interest, no, you don’t need Weaviate!

2 Likes

You are interlinking but through two physical software layers.

embeddings → hash (memory layer)
hash → strings (DB layer)

1 Like

It makes it way easier to label stuff as well

hash → cached rank index results

1 Like

Why I said pure level.

Weaviate does a fantastic job combining the two without sacrificing (too much) the performance side of things.

Me, personally, I like a single source of truth

1 Like

Yes, the flow is more

query → hash → lookup in DB | { Get previous | generate new / store}

Then go through search and ranking (memory) then retrieval (DB) then prompt (LLM), then wait for LLM response.

2 Likes

Yeah, my version works too. Fits in a little AWS Lambda function, small footprint, easy to understand, fast, etc. Just depends on how you plan on deploying I guess.

Surprised you don’t have your own Rust version out there @RonaldGRuckus :rofl:

1 Like