Does OpenAI Embedding
return different vectors for the same text input? However, it seems to randomly return from a specific set of vectors. Why is this happening? If this is not the correct behavior, could it be due to inconsistent information across nodes in the server cluster? Is this a bug or by design?
the testing as below:
OpenAI.API.http/OpenAI.http at master · AwesomeYuer/OpenAI.API.http (github.com)
Embeddings only return vectors. The vector is the same for the same input, same model, and the same API endpoint. But we have seen differences between the OpenAI endpoint and the Azure endpoint for the same model. So a pick an endpoint and stick with it to avoid any differences.
There could be very slight roundoff errors in the embedding when calling it over and over for the same (above) configuration, but this is in the noise and won’t effect your search result.
Azure OpenAI Service have not this issue!
I’ve found this to happen on particular strings of input. For example, “hello world” will consistently output the same vector but “Hello” will not. You can try this yourself:
r1 = openai.Embedding.create(input=["Hello"], engine="text-embedding-ada-002")
v1 = r1["data"][0]["embedding"]
r2 = openai.Embedding.create(input=["Hello"], engine="text-embedding-ada-002")
v2 = r2["data"][0]["embedding"]
v1 == v2 # False!
v1[0], v2[0]
# v1[0] = -0.021834855899214745
# v2[0] = -0.021884862333536148
The difference is more than a rounding error in my eyes and when I re-run this test I get the same v1[0]
but a different v2[0]
…
(-0.021834855899214745, -0.021855857223272324)
(-0.021834855899214745, -0.0218560378998518)
Sometimes this will even output the same vector so v1 == v2
.
This is not a rounding error on the client side. The json returned by the OpenAI API is different for the same input string. So it does verifiably return different data for the same input (sometimes). The distance between variations on the returned data is so small I can’t imagine this is an issue. But it would be nice to know why this is happening.
They do, I found it today, some documents where getting in and out the cosine threshold
I embedded “hello world” 10 times in a row and the output is not always the same.
From what i’ve seen, the difference of any given coefficient is less than 1e-4. For example, first coefficient embedding “hello world” using “text-embedding-ada-002”
the requests are going to httpx: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
(using async client in openai 1.2.3)
[-0.0149070480838418,
-0.0149070480838418,
-0.0149070480838418,
-0.0149070480838418,
-0.0149070480838418,
-0.0149070480838418,
-0.0149070480838418,
-0.0149070480838418,
-0.014956329017877579,
-0.0149070480838418]
it’s actually worse than that. eg another run:
[-0.0149070480838418,
-0.014948742464184761,
-0.014948742464184761,
-0.0149070480838418,
-0.014956329017877579,
-0.014948742464184761,
-0.0149070480838418,
-0.0149070480838418,
-0.01559481117874384,
-0.0149070480838418]
How much do your vector dot products vary? Say take one of those vectors, fix it as a reference, and dot the other vectors with it. What is this variation?
Some variation is expected because of the random timing in the GPU’s, and that floating point is not associative, and they are likely taking the last hidden layer and scaling out to the unit hyper-sphere, which would magnify the error for hidden states close to the origin.
If you need consistency it is an option to create a hash of the text to be embedded and then compare those as a decision criterion before creating new embeddings.
This way you can prevent the same text from being embedded twice and you don’t have to worry about the differences in the results.
This still seems to be an issue:
I’ve raised an issue in the python CLI and pointed to the developer forum.
>>> from openai import OpenAI
>>> client = OpenAI(api_key="...")
>>> for i in range(50):
... client.embeddings.create(model="text-embedding-ada-002", input="hello").data[0].embedding[:5]
...
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.024994507431983948, -0.019366780295968056, -0.027768738567829132, -0.031097816303372383, -0.02462460845708847]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.0250103659927845, -0.01939525455236435, -0.027798103168606758, -0.030995413661003113, -0.024706488475203514]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.0250103659927845, -0.01939525455236435, -0.027798103168606758, -0.030995413661003113, -0.024706488475203514]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025048330426216125, -0.019377758726477623, -0.027810918167233467, -0.0310361385345459, -0.02466500550508499]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.02502652257680893, -0.019331468269228935, -0.027801373973488808, -0.031051915138959885, -0.02469618245959282]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
I’m also able to replicate this using the https endpoint.
What happens when you try to run a distance function on all of these slightly different vectors?
Sure,
>>> from openai import OpenAI
>>> client = OpenAI(api_key="...")
>>> embeddings = client.embeddings.create(model="text-embedding-ada-002", input=["hello" for _ in range(50)])
>>> for x in embeddings.data:
... print(x.embedding[:5])
...
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
...
[-0.0250103659927845, -0.01939525455236435, -0.027798103168606758, -0.030995413661003113, -0.024706488475203514]
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
...
[-0.025058425962924957, -0.019388560205698013, -0.027781018987298012, -0.030979404225945473, -0.024688364937901497]
>>> base_emb = np.array(embeddings.data[0].embedding, dtype="float64")
>>> diff_emb = np.array(embeddings.data[36].embedding, dtype="float64")
>>> print(base_emb[:5])
[-0.02505843 -0.01938856 -0.02778102 -0.0309794 -0.02468836]
>>> print(diff_emb[:5])
[-0.02501037 -0.01939525 -0.0277981 -0.03099541 -0.02470649]
>>> dist = np.linalg.norm(base_emb-diff_emb)
>>> dist
0.0015302552960605469
It’s not a “small” level on accuracy.
from openai import OpenAI
import numpy as np
import os
client = OpenAI(api_key=os.environ['OPENAI_KEY'])
lolz = [
client.embeddings.create(
model="text-embedding-ada-002",
input="hello"
).data[0].embedding for i in range(2)
]
np.dot(lolz[0], lolz[1])
0.999999971277652
Are you expecting a 1?
lolz = [
client.embeddings.create(
model="text-embedding-ada-002",
input="hello %s" % i
).data[0].embedding for i in range(2)
]
np.dot(lolz[0], lolz[1])
0.9038732013151791
I’d say that .9999997 is close enough
lolz = [
client.embeddings.create(
model="text-embedding-ada-002",
input="hello" + (" " * (i))
).data[0].embedding for i in range(2)
]
np.dot(lolz[0], lolz[1])
0.9119886859504618
i went a little further
lolz = [
client.embeddings.create(
model="text-embedding-ada-002",
input="hello"
).data[0].embedding for i in range(50)
]
olawd = [
np.dot(lolz[x], lolz[y])
for x in range(len(lolz)) for y in range(len(lolz))
]
sum(olawd) / len(olawd)
0.9999996662160927
As you can see from my specific examples, roughly 1/10 embeddings are way off for the exact same input “hello”.
Those examples with the same input and different embeddings have a large distance between the vectors.
These would lead to inaccuracies in ordering and downstream models, it’s a reasonable concern.
If Hello has a distance of ~0.99999967
and the closest (maybe) fathomable as "Hello " is ~0.91198869
What exactly is the problem? Your example is nit-picking numbers without meaning. See: Not being pragmatic.
If you need it to be perfect then hash the strings and look it up that way first.
Strange. Everyone else uses embeddings just fine. They are inherently fuzzy.
If a string is hashed and the stored embedding is the incorrect one, you’d be looking up the incorrect one via your hash? How do you know your first stored hash is correct unless you embed multiple times which would be a hack of a solution.
The context here, items are embedded according to text and stored. For sorting/search you’d expect these embeddings to be deterministic. Given embeddings are supposed to be a frozen layer of a particular LLM and it is only an issue with ada-002.
I also disagree with everyone else uses embeddings just fine, there’s a bunch of threads on OpenAI’s forum with the same issue, issues arising on libraries, multiple stackoverflow threads and even articles on it. Limited to links otherwise I’d publish them here.
This one article it quite well: OpenAI or Azure OpenAI: Can models be more deterministic depending on API?
There is no “incorrect one” when 9/10 are different…
As I show, this is not a problem had by any of the over one dozen GPT-3 embeddings.
I don’t see the logic here. Going back to being pragmatic: If the difference is on a scale of < 0.000001 then why would it be incorrect? What are you actually trying to accomplish with your embeddings?
The article states that Azure always returns the same results, so maybe that’s a better solution?
Yes. Many people notice that the embeddings are slightly different and think that it’s going to cause issues. Yet 95% of the world continues working with it just fine. I don’t see the logic here either.
But, I don’t think this will go anywhere. I’m feeling a complete lacking of pragmatism so we can hold hands and solve this together. Have you ever looked closer at the vector values and counted how many digits they have?
0.025058425962924957
Now call Azure, or even call the OpenAI endpoint using Curl and then count how many digits they have as well
0.025058426
Seeing something?
Yes. The python library returns extra digits . So you are safe to either round it yourself, or just deal with the inherent noise that really doesn’t make any difference.