Does Embeddings have temperature?

Phinder · January 22, 2025, 9:22am

Hello

I’ve been using embeddings for a while (ada and now “text-embedding-3-small”) and have noticed that the exact same word produces small variations in the vector embeddings output.

I can’t find anything in the docs (https://platform.openai.com/docs/guides/embeddings/?lang=python) about controlling temperature, so I was expecting the default to be temp=0 and identical outputs from the same query.

This is an example showing the first 5 rows of vector embeddings data from two repetitions of the query “friction” to the model “text-embedding-3-small”:

“friction”	“friction”	Diff
-0.009956564	-0.009942042	-1.4522E-05
0.024430862	0.024431106	-2.44E-07
-0.009832289	-0.009817766	-1.4523E-05
-0.02318812	-0.02318835	2.3E-07
0.009817668	0.009832387	-1.4719E-05

As seen in the table there are slight differences (diff column is non zero). I don’t know how much this matters in practice (depends on use case), but it would be nice if it was possible to set the temperature and seed to get repeatable output data.

Anyone knows if it is possible to control temp for embeddings?

_j · January 22, 2025, 9:30am

The embeddings endpoint doesn’t have softmax, doesn’t generate tokens, doesn’t sample from token probabilities, so no, it does not have a temperature parameter.

You observe non-determinism in the AI model. Subsequent runs have variations in the tensor math results. The same is seen across all current OpenAI models - slight variations in outputs between runs.

The significant figures of these vector value variations is lower than the actual AI quality of similarity search rankings, so it doesn’t matter too much.

friction_1	friction_2	Diff	Percentage Variance
-0.00995656	-0.00994204	-1.4522e-05	0.145854
0.0244309	0.0244311	-2.44e-07	-0.000998737
-0.00983229	-0.00981777	-1.4523e-05	0.147707
-0.0231881	-0.0231883	2.3e-07	-0.000991887
0.00981767	0.00983239	-1.4719e-05	-0.149924

platypus · January 22, 2025, 12:00pm

Hi @Phinder !

Just to add to what @_j wrote: you will actually see variations in chat models (non embedding ones) even when you set temperature to zero and set a seed value. For example, you will see variations in logprobs on tokens, and therefore outputted tokens themselves. As Jay mentioned - there is massive amount of infra on OpenAI side, and all the slight precision differences in calculations actually add up!

Phinder · January 22, 2025, 2:04pm

Thanks for the inputs!

As an example, here are logits from a local model (Anthropic BSides CFT model). I’m sure there are machine precision errors eventually but they are not normally visible when inspecting the logits. At least the first 5 digits of the output logits are always the same in consecutive queries.

Run 1:
tensor([[[-7.5829e-02, 1.3152e-02, -5.3292e-02, -5.5444e-02, 2.0464e-02,

Run 2:
tensor([[[-7.5829e-02, 1.3152e-02, -5.3292e-02, -5.5444e-02, 2.0464e-02,

So for the embeddings it is much more noticable. But I guess I will have to live with it, just not used to seeing that kind of variation…

platypus · January 22, 2025, 2:16pm

The local models will be much more stable in this regard, because it’s executing on single/same hardware with a single instance. The variability comes from a massive infrastructure that consists of many instances of these models spread over thousands of GPUs, and slight deviations in precisions add up.

I wrote a bit more about this variability here.

_j · January 22, 2025, 2:33pm

GPT-3 models and their embeddings counterpart had no such behavior, would always return identical logprobs for identical inputs until shutoff, and ran alongside chat models with the symptom.

Therefore, the non-repeatability cannot be directly attributed to simply datacenter deployment or “fingerprint” they return with little utility. OpenAI has never answered what is the cause, if it is an effect of architecture or optimization or if they were even attempting to make outputs fuzzy for less discoverability of technology (before complete ability to reveal embeddings size and underlying model parameter count was published.)

curt.kennedy · February 7, 2025, 4:53am

It’s caused by the asynchronous timing in the GPU’s. IEEE floats do not obey the distributive law. The curse of massive parallelism. But the benefits, massive speed, outweigh any non-determinism, since it’s in the noise anyway.

jochenschultz · February 7, 2025, 8:43am

You can always do multiple requests and average them to make it more deterministic though…

Or add them all and have a wider representation of the word. Which may be a good thing or may not (because it could create relations on one part of the space but not on the other parts of the same word)…

_j · February 8, 2025, 12:56am

You really don’t need to make multiple calls to counter any symptom. Use your multiple calls for other semantic search techniques.

=== Non-Determinism Report for model=text-embedding-3-large ===
Input text: 'Artificial intelligence is changing the world.'
Number of trials: 10
Embedding dimension: 3072
Pairwise similarities across 45 pairs:
  Min:    0.999997
  Max:    1.000000
  Mean:   0.999999
  Median: 0.999999
  StDev:  0.000001

Average similarity of each run’s vector to the ‘mean vector’: 1.000000
Mean dimension-wise stdev: 0.000015
Max dimension-wise stdev:  0.000071

curt.kennedy · February 10, 2025, 7:59pm

The only way to reduce the noise is de-dupe at the string level … so same string gets same prior embedding.

So cacheing everything basically.

Helps reduce embedding costs and latencies over time too.

But DB costs go up. (oh well)

jochenschultz · February 10, 2025, 8:40pm

Had to do some optimizations on a database lately - something with millions of entries every day… and after I was done the CPU load was reduced to 50%. That was when they found out they had accidently deployed the database to a stage server with a single core
Had no shell access… otherwise htop would have been my first step obviously. So all just stored procedures and index optimization and stuff…

back to the topic - how not to use db relations in combination with embeddings will always be strange to me.

curt.kennedy · February 10, 2025, 8:56pm

I think most folks don’t interlink DB’s and embeddings because it adds a layer of complexity.

But any good embedding service should have the strings behind the embeddings in a DB and the embedding vectors in memory (need both).

So the embedding searching is done in memory for speed, and the DB is leveraged to retrieve the text (strings) from the indices corresponding to the top embeddings.

If you are going through all this work, might as well throw in a cheap call to look up the string hash in the DB before embedding it.

It’s fast, inexpensive, and might save you something like 20-30% off the embedding costs, depending on how repetitive you incoming queries are, of course.

The more repetitive, the more savings! (and lower latencies!)

anon10827405 · February 10, 2025, 8:57pm

Weaviate be like “Am I nothing to you?”.

Maybe on a pure level it would make sense to separate the two but interlinking is definitely beneficial when you are switching and intertwining DB & semantic look ups

curt.kennedy · February 10, 2025, 9:00pm

The papers and theory behind Weaviate are fun to replicate, but if you have the time and interest, no, you don’t need Weaviate!

curt.kennedy · February 10, 2025, 9:03pm

You are interlinking but through two physical software layers.

embeddings → hash (memory layer)
hash → strings (DB layer)

jochenschultz · February 10, 2025, 9:04pm

It makes it way easier to label stuff as well

_j · February 10, 2025, 9:05pm

hash → cached rank index results

anon10827405 · February 10, 2025, 9:06pm

Why I said pure level.

Weaviate does a fantastic job combining the two without sacrificing (too much) the performance side of things.

Me, personally, I like a single source of truth

curt.kennedy · February 10, 2025, 9:09pm

Yes, the flow is more

query → hash → lookup in DB | { Get previous | generate new / store}

Then go through search and ranking (memory) then retrieval (DB) then prompt (LLM), then wait for LLM response.

curt.kennedy · February 10, 2025, 9:18pm

Yeah, my version works too. Fits in a little AWS Lambda function, small footprint, easy to understand, fast, etc. Just depends on how you plan on deploying I guess.

Surprised you don’t have your own Rust version out there @RonaldGRuckus

Topic		Replies	Views
Why `OpenAI Embedding` return different vectors for the same text input? API	35	10607	April 30, 2024
Can text-embedding-ada-002 be made deterministic? API embeddings , ada	18	7940	December 24, 2023
Is the lower the temperature, the more correct the answer is? Prompting gpt-4 , chatgpt	5	7217	March 15, 2024
Embedding Model Determinism, big difference API api-embedding	3	292	April 7, 2025
Different embeddings for exact same text API embeddings	7	3895	December 18, 2023

Does Embeddings have temperature?

Related topics