Question on text-embedding-ada-002

I am using the below “boiler” code to get the embedding under different models:

def get_embedding(text, model="text-embedding-ada-002", api_key:str =mykey:
    openai.api_key = api_key
    text = text.replace("\n", " ")
    return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding'] 

and then i just calculate some simple “tests” such as below:

model ='text-embedding-ada-002'
comp_x=['bull', 'bullish', 'love','i love apple', 'rise','positive','overall attitude of investors towards financial market is extremely positive']
comp_y =['bear', 'bearish', 'hate','i hate apple','fall', 'negative','overall attitude of investors towards financial market is extremely negative']
for (x,y) in zip(comp_x, comp_y):
    l_x= get_embedding(x,model=model)
    l_y= get_embedding(y,model = model )
    print (f'{x} vs {y}  \nSimilarity: {1-cosine(l_x, l_y)}')

here is what i see as results:

bull vs bear  
Similarity: 0.8770496452605026
bullish vs bearish  
Similarity: 0.9221341441559032
love vs hate  
Similarity: 0.8440677043256933
i love apple vs i hate apple  
Similarity: 0.912365899889429
rise vs fall  
Similarity: 0.859895846856977
positive vs negative  
Similarity: 0.9312248550781324
overall attitude of investors towards financial market is extremely positive vs overall attitude of investors towards financial market is extremely negative  
Similarity: 0.9420619054524753

I have to say, i am suprised by such simple tests. From reading the limited “docs”, my impression is that ada-002 should do some “contextual” ML (NN with transformers, etc etc) with a lot of quite advanced bells and whistles.

however the result above seems to indicate that ada-002 is doing more a “syntaxical” match. the fact that “hate” vs “love” has a score of 0.844 yet “i hate apple” vs “i love apple” has a score of 0.912>> i thought this is a telling sign.

In the last example, where the sentence is actually “long” but the ONLY meaningful distinction is the word “positive” vs “negative”. the similarity is at a whopping 0.942!

I wonder if anyone can shed some lights on this?? are there any known limitation on ADA002? or maybe i am using ada002 wrongly?

many thanks

3 Likes

Yeah, I see where you’re going with this. But is an embedding approach ideal for this good vs bad objective? The embeddings are simply vectors into a space of content. Whereas, something like this might give you better chances.

Given a text, give it a controversy score from 0 to 10.

Examples:

1 + 1 = 2
Controversy score: 0

Starting April 15th, only verified accounts on Twitter will be eligible to be in For You recommendations
Controversy score: 5

Everyone has the right to own and use guns
Controversy score: 9

Immigration should be completely banned to protect our country
Controversy score: 10

The response should follow the format:

Controversy score: { score }
Reason: { reason }

Here is the text.
1 Like

As far as I know - I am not experienced in this matter enough - just a coincidence - I studied about it recently:
Similarity = cosine(l_x, l_y) // it is called Cosine similarity;
Distance = 1 - cosine(l_x, l_y) // it is called Cosine distance to express difference between words;

has a distance (or difference) of 0.844 - so it has a similarity score of 0.156 only.

So, I believe there is some significant “distance” between “hate” and “love”.

I am not sure if this could be of any help.

1 Like

You calculated the distance between the words.
You can now take the results and format them into prompts or a single batched prompt.

You can build a completion request,for example:
Given the following result of a cosine distance between two words, explain the difference. The response should be{words} {distance}{explanation}. The request is words=good vs bad, distance=0.92

You can try this request with GPT-3.5-turbo or GPT-4.

I would recommend giving the model several examples or defining cosine similarity outputs first.Then making the requests.

1 Like

Tx for the reply AlexDeM. Please note that I used “cosine” (the distance) NOT “cosinesimilarity”.

That is precisely why in my code example I used 1- cosine as the similarity for display.

So the result of 0.844 between love and hate indeed means: the “similarity” between the two is 0.844.

@jz97 `, @denis.rothman76 I owe you apologies.

I’ve been caught in a trap: one of the “stupidest function definitions” ever made in a library (probably, in the history of software development). If you like I can delete my comment.

Although an entire civilization has learned since the first Islamic expansion, bringing the beginnings of trigonometry, that:
The cosine of a zero angle (between two identical vectors, for example) is 1,
some OpenAI developers decided to go against the grain of history and the most basic definitions of mathematics.

@jz97 is right!
The openai.embeddings_utils.py library defines:
cosine_similarity to calculate similarity with range [-1 to 1];
cosine (where is the _distance ???) to calculate distance with range [2 to 0];

They made everything with underline/underscore (_) except cosine (_distance). Go figure. Even the models (or some) don’t know about such a definition, they advise the use of cosine_distance as correct.

From the OpenAI documentation:
# get distances between the source embedding and other embeddings (function from embeddings_utils.py)
distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")

So, the following table:
Similarity _____| cosine_similarity | cosine(_distance)
Function: ______|_ A.B/(|A| x |B|) _| 1 - cosine_similarity(A, B)
Range: _________|____ [-1 to 1] ____| [2 to 0]
full dissimilar |_______ -1 ________| 2 [= 1 - (-1)]
orthogonal _____|________ 0 ________| 1 [= 1 - 0]
identical ______|________ 1 ________| 0 [= 1 - 1]

1 Like

@AlexDeM

You don’t owe an apology to anybody!!!

The only way to move forward in AI is to take risks and progress through trial and error.

We only learn through mistakes, introspection, and support from others. Exactly like OpenAI with us beta testers.

Continue at full speed and don’t let anything or anyone slow you down! :sunglasses:

1 Like

Yeah second that… we are equally guilty in exploring and learning all these.

Back to this topic, I do get the superior performance using GPT3.5 for completion.

However the text-embedding-ada2 just makes so much sense in terms of cost… and one has to think with all the 1500+ dimension of the vecor. There has to be a more “sensible” way to extract, at least something marginally similar to a full GPT “completion” model has done ?

Maybe simple “cosinel distance just not the right way?
Maybe there are some other technique to derive the “sematics similarity” from the embedding vector?

I refuse to believe that the embedding is “bad”, in other words, not as what OpenAI advertised. Surely I am not alone in trying to derive the “semantics similarities “ between two (ada2) embedding vectors?

2 Likes

In Chapter 15 of Transformers for NLP, 2nd Edition Edition, I share how I see embedding models in production:

1.Collect data and organize it without AI in a classical way
2.Run one of OpenAI’s powerful embedding models. Save the embedding to limit the cost.
3. In Phase 1 of a project don’t do anything fancy. Apply a Scikit-learn ML algorithm. By the way, this is in the OpenAI documentation.

Chat completions are another story.
In the chapter17 directory of the GitHub repository of the book, there is an example of advanced prompt engineering.
The output of the ML of item 3 can be used as a knowledge base along with the labels of the data to generate cool responses.

We just need to think in terms of pipelines that contain classical programming, DL with OpenAI and ML.

With that upskill DS/AI engineers are all set for 2020s!:blush::sunglasses:

Yes, ada-002 has poor geometry and is non-isotropic.

I mentioned it initially a while ago over here:

And I came up with a solution that involves PCA (Principal Component Analysis) fitting of the data over here:

Realize that ada-002 only requires dot-products as a distance metric (equivalent to cosine similarity since ada-002 consists of unit vectors.). If you had a TON of embeddings and needed a speedup in your search, you could drop down to a Manhattan metric instead, since this only involves additions (subtractions) and absolute values.

But with a 400k set of embeddings using exhaustive search only takes 1 second using dot-products (inner-products) and basic cloud functions.

3 Likes

thank you Curt. Very interesting thought on using PCA to “cluster” the embedding vectors.
Will try.
I did try the K-mean clusters, but not looking promising at all… I think that’s only unsurprising becuase K-mean use the " cosin-distance" as the metric to isolate anyways?

Further on this thread, the more i play with with the Ada2 embedding, the more I question is it really “semantics search” as it is adverstised in OpenAi’s own wesite?

look at the below simple results:

good vs not good  
Similarity: 0.879691949881845
extremely good vs extremely not good  
Similarity: 0.9220934366536766
this code is extremely good vs this code is extremely not good  
Similarity: 0.933950852643895

notice that the similarity increase as i have more “common” words in place… just on this simple example, this seems to naively suggest that ADA2 is doing a simple “syntax” match rather than “Sematics” match. Let me put another way, quite bluntly, i do not see evidence of “ML”/“DL” purely from this.

2 Likes

I have same results with you.
And I realize “ML/DL” are prices, not results haha