I am using the below “boiler” code to get the embedding under different models:
def get_embedding(text, model="text-embedding-ada-002", api_key:str =mykey:
openai.api_key = api_key
text = text.replace("\n", " ")
return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']
and then i just calculate some simple “tests” such as below:
model ='text-embedding-ada-002'
comp_x=['bull', 'bullish', 'love','i love apple', 'rise','positive','overall attitude of investors towards financial market is extremely positive']
comp_y =['bear', 'bearish', 'hate','i hate apple','fall', 'negative','overall attitude of investors towards financial market is extremely negative']
for (x,y) in zip(comp_x, comp_y):
l_x= get_embedding(x,model=model)
l_y= get_embedding(y,model = model )
print (f'{x} vs {y} \nSimilarity: {1-cosine(l_x, l_y)}')
here is what i see as results:
bull vs bear
Similarity: 0.8770496452605026
bullish vs bearish
Similarity: 0.9221341441559032
love vs hate
Similarity: 0.8440677043256933
i love apple vs i hate apple
Similarity: 0.912365899889429
rise vs fall
Similarity: 0.859895846856977
positive vs negative
Similarity: 0.9312248550781324
overall attitude of investors towards financial market is extremely positive vs overall attitude of investors towards financial market is extremely negative
Similarity: 0.9420619054524753
I have to say, i am suprised by such simple tests. From reading the limited “docs”, my impression is that ada-002 should do some “contextual” ML (NN with transformers, etc etc) with a lot of quite advanced bells and whistles.
however the result above seems to indicate that ada-002 is doing more a “syntaxical” match. the fact that “hate” vs “love” has a score of 0.844 yet “i hate apple” vs “i love apple” has a score of 0.912>> i thought this is a telling sign.
In the last example, where the sentence is actually “long” but the ONLY meaningful distinction is the word “positive” vs “negative”. the similarity is at a whopping 0.942!
I wonder if anyone can shed some lights on this?? are there any known limitation on ADA002? or maybe i am using ada002 wrongly?
many thanks