Hi all, I am getting different embeddings for the same texts.
Using the following script:
import time
import numpy as np
import openai
openai.api_key = ...
model = 'text-embedding-ada-002'
def test():
def get_openai_embeddings(texts, model):
result = openai.Embedding.create(
model=model,
input=texts,
)
return result
texts = [
"The Lake Street Transfer station was a rapid transit station on the Chicago \"L\" that linked its Lake Street Elevated with the Logan Square branch of its Metropolitan West Side Elevated Railroad from 1913 to 1951.",
"The Lake Street and Metropolitan were both constructed in the 1890s by different companies. The two companies owning the lines, along with two others, unified their operations in the early 1910s; as part of the merger, the Lake Street's owner had to close its nearby station on Wood Street and build a new one to form a transfer with the Metropolitan.",
"This transfer station had a double-decked construction (depicted), with the Metropolitan's infrastructure crossing over the Lake Street. This arrangement continued until the Dearborn Street subway opened on February 25, 1951, replacing the Logan Square branch in the area and leading to the station's closure. The site would eventually serve as the junction of the modern Pink Line to the Green Line.",
]
for text in texts:
print(text)
a = get_openai_embeddings(texts=text, model=model)
b = get_openai_embeddings(texts=text, model=model)
a_e = np.array(a['data'][0]['embedding'])
b_e = np.array(b['data'][0]['embedding'])
print('Rounded vectors to 5 decimals equal', a_e.round(5) == b_e.round(5))
print('Max elementwise ratio', (a_e / b_e).max())
print('Min elementwise ratio', (a_e / b_e).min())
print('normalized norm', ((a_e - b_e)**2).sum()**0.5 / (a_e**2).sum()**0.5)
print()
time.sleep(1)
if __name__ == '__main__':
test()
If I run it a few times I get:
The Lake Street Transfer station was a rapid transit station on the Chicago "L" that linked its Lake Street Elevated with the Logan Square branch of its Metropolitan West Side Elevated Railroad from 1913 to 1951.
Rounded vectors to 5 decimals equal [False False True ... False False False]
Max elementwise ratio 2.9404772399566323
Min elementwise ratio -7.386667323598624
normalized norm 0.0020847767758131125
The Lake Street and Metropolitan were both constructed in the 1890s by different companies. The two companies owning the lines, along with two others, unified their operations in the early 1910s; as part of the merger, the Lake Street's owner had to close its nearby station on Wood Street and build a new one to form a transfer with the Metropolitan.
Rounded vectors to 5 decimals equal [ True True True ... True True True]
Max elementwise ratio 1.0
Min elementwise ratio 1.0
normalized norm 0.0
This transfer station had a double-decked construction (depicted), with the Metropolitan's infrastructure crossing over the Lake Street. This arrangement continued until the Dearborn Street subway opened on February 25, 1951, replacing the Logan Square branch in the area and leading to the station's closure. The site would eventually serve as the junction of the modern Pink Line to the Green Line.
Rounded vectors to 5 decimals equal [False False False ... False False False]
Max elementwise ratio 5.401572035423635
Min elementwise ratio -13.42049661181725
normalized norm 0.004099269126226605
It seems the vectors returned can sometimes have very different results! I understand that a certain amount of stochasticity is possible, but it seems that some elements are very different.