I don't know how to analyze the embedding

gu3735 · May 15, 2023, 9:02am

Looking at the data above, the cosine similarity of “love” and “like” should be close to 1, “love” and “hate” should be close to -1, and “love” and “find” should be close to 0. I thought it would, but the API is not working as I thought.
In fact, there are no negative numbers.
What should we think about these numbers?

iamflimflam1 · May 15, 2023, 9:16am

Where is your cosine_similarity function coming from?

gu3735 · May 15, 2023, 10:36am

Thank you for commenting.
It comes from openAI’s util.

PriNova · May 15, 2023, 11:14am

Did you try to normalize the similarity matrix?

You can do this with Pandas too.

scaler = MinMaxScaler(feature_range=(-1, 1))
normalized_matrix = scaler.fit_transform(df)
normalized_df = pd.DataFrame(normalized_matrix, columns=df.columns, index=df.index)

I did not try it out by myself, but I hope this can give you some hints to start with.

From your matrix I see, that find ↔ love is more far away than love ↔ hate which makes sense, doesn’t it?

Diet · May 15, 2023, 11:56am

Someone will correct me if I’m wrong, but maybe you need to think about this differently?

You have a 2000 something dimensional space, and you’re surprised that “love” and “hate” are relatively close to each other.

They’re both words, they’re both expressing emotions, they’re both 4 letters long (probably not relevant), they’re both english, etc, etc.

In this gigantic vector space, they have much more in common than not, so it would be very surprising indeed if they were diametrically opposed.

Hope this helps

gu3735 · May 16, 2023, 1:34am

Thank you for leaving a comment.
I know cosine similarity can go down to -1. Even so, it was difficult to find words with a similarity of less than 0.7. (This is the part I haven’t found yet.)

Even with “인형” in Korean, which means “doll” in my native language, it showed a relationship of over 0.7.

If minmax scaling is used, -1 will be enough, but since this is not a value derived from cosine similarity, the meaning seems to be faded a lot.

Because of my poor English skills, I use a lot of translators, so please forgive me if there are awkward sentences.

gu3735 · May 16, 2023, 1:37am

Thank you for commenting.

I heard that you are right.

“hate” and “love” are very related words,
However, when looking at the cosine similarity with a value of [-1, 1], it is difficult to find a value less than 0.7 in reality.

If this is the best result, it seems that it is difficult to expect the same result as the example of the expression “King-Queen + Girl = Boy” that we have seen a lot.

Because of my poor English skills, I use a lot of translators, so please forgive me if there are awkward sentences.

Diet · May 16, 2023, 11:45am

I haven’t seen this done, but it does sound interesting. My understanding is that you need to do the arithmetic before you take the cosine similarity, to remove all the common dimensions. Is that the point of confusion?

One thing you could try is to find all common dimension indices in the vectors for (love/hate) and delete them, and then take the cosine similarity again, you should get the -1 you’re looking for. (you could set them to 0, but I don’t know how your cosine similarity algorithm responds to zeroes)

I hope I’m understanding you correctly

gu3735 · May 17, 2023, 1:03am

Thank you again.

Perhaps you understand it correctly.

Currently I have an embedding and I am trying to utilize triplet loss to fit my data. (However, it is not working well. It does not seem to have much effect on data other than training classification )

Topic		Replies	Views
Is it possible to achieve embeddings cosine similarity approaching -1? Community embeddings	6	824	April 18, 2024
Embeddings and Cosine Similarity API	20	14167	February 25, 2024
Some findings on the meaning of embedding. Discuss with the example of "woman - man = queen - king = female" API	1	1906	April 22, 2023
Semantic search through embeddings API	3	1280	January 22, 2023
Embedding Results Scale Seems Off API embeddings , ada	8	5055	December 24, 2023

I don't know how to analyze the embedding

Related topics