Non-deterministic embedding results using text-embedding-ada-002

Hi all, I am getting different embeddings for the same texts.

Using the following script:

import time

import numpy as np
import openai

openai.api_key = ...

model = 'text-embedding-ada-002'


def test():
    def get_openai_embeddings(texts, model):
        result = openai.Embedding.create(
            model=model,
            input=texts,
        )
        return result

    texts = [
        "The Lake Street Transfer station was a rapid transit station on the Chicago \"L\" that linked its Lake Street Elevated with the Logan Square branch of its Metropolitan West Side Elevated Railroad from 1913 to 1951.",
        "The Lake Street and Metropolitan were both constructed in the 1890s by different companies. The two companies owning the lines, along with two others, unified their operations in the early 1910s; as part of the merger, the Lake Street's owner had to close its nearby station on Wood Street and build a new one to form a transfer with the Metropolitan.",
        "This transfer station had a double-decked construction (depicted), with the Metropolitan's infrastructure crossing over the Lake Street. This arrangement continued until the Dearborn Street subway opened on February 25, 1951, replacing the Logan Square branch in the area and leading to the station's closure. The site would eventually serve as the junction of the modern Pink Line to the Green Line.",
    ]

    for text in texts:
        print(text)
        a = get_openai_embeddings(texts=text, model=model)
        b = get_openai_embeddings(texts=text, model=model)
        a_e = np.array(a['data'][0]['embedding'])
        b_e = np.array(b['data'][0]['embedding'])
        print('Rounded vectors to 5 decimals equal', a_e.round(5) == b_e.round(5))
        print('Max elementwise ratio', (a_e / b_e).max())
        print('Min elementwise ratio', (a_e / b_e).min())
        print('normalized norm', ((a_e - b_e)**2).sum()**0.5 / (a_e**2).sum()**0.5)
        print()
        time.sleep(1)


if __name__ == '__main__':
    test()

If I run it a few times I get:

The Lake Street Transfer station was a rapid transit station on the Chicago "L" that linked its Lake Street Elevated with the Logan Square branch of its Metropolitan West Side Elevated Railroad from 1913 to 1951.
Rounded vectors to 5 decimals equal [False False  True ... False False False]
Max elementwise ratio 2.9404772399566323
Min elementwise ratio -7.386667323598624
normalized norm 0.0020847767758131125

The Lake Street and Metropolitan were both constructed in the 1890s by different companies. The two companies owning the lines, along with two others, unified their operations in the early 1910s; as part of the merger, the Lake Street's owner had to close its nearby station on Wood Street and build a new one to form a transfer with the Metropolitan.
Rounded vectors to 5 decimals equal [ True  True  True ...  True  True  True]
Max elementwise ratio 1.0
Min elementwise ratio 1.0
normalized norm 0.0

This transfer station had a double-decked construction (depicted), with the Metropolitan's infrastructure crossing over the Lake Street. This arrangement continued until the Dearborn Street subway opened on February 25, 1951, replacing the Logan Square branch in the area and leading to the station's closure. The site would eventually serve as the junction of the modern Pink Line to the Green Line.
Rounded vectors to 5 decimals equal [False False False ... False False False]
Max elementwise ratio 5.401572035423635
Min elementwise ratio -13.42049661181725
normalized norm 0.004099269126226605

It seems the vectors returned can sometimes have very different results! I understand that a certain amount of stochasticity is possible, but it seems that some elements are very different.

1 Like

Hi @thiboeri

I ran your text (the first example) 10 times and got the same embedded vector 10 times, using a Ruby API wrapper (not Python).

HTH

:slight_smile:

Yes. The OpenAI python library will return many more decimal places; essentially noise. I believe it’s something to do with the conversion to base64? Can’t remember exactly.

Regardless, the similarity scores will be the same.

We just faced the same issues for the first time here when using the openai-python package.

We did some tests and around 11% of them were considerably different, even being near in the vector space.

UPDATE: For anyone facing this issue, the embeddings’ endpoint is deterministic. The reason to this difference is caused by the OpenAI Python package, as it uses base64 as the default encoding format, while others don’t.

1 Like

See this post for more discussion on embedding decimal places Discrepancy in embeddings precision - #8 by curt.kennedy

2 Likes

hi, faced same issue with python here. :smiling_face_with_tear: In my dataset, Im getting only 2 decimal precision.

did you fix the prob?
I’ve read all the instructions and guide but still don’t know what to do.

Is it the only way not to use python? :smiling_face_with_tear:

I think it’s worth to separate the python precision issue discussed in Discrepancy in embeddings precision - #7 by RonaldGRuckus from the issue of the OpenAI API returning slightly different embeddings for the same exact input. Instead of using python, let’s use curl directly:

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "input": "<TEXT>",
    "model": "text-embedding-ada-002"
  }' | jq '.data[0].embedding[0]'

Make sure to set your OPENAI_API_KEY, the jq command will return the 1st number from the embeddings, if you run this a couple of times, you will see that the numbers can be slight different, in my case for example: -0.026714837, -0.026664866 (difference of about 5e-05).

1 Like