Looking at the data above, the cosine similarity of “love” and “like” should be close to 1, “love” and “hate” should be close to -1, and “love” and “find” should be close to 0. I thought it would, but the API is not working as I thought.
In fact, there are no negative numbers.
What should we think about these numbers?
Where is your cosine_similarity
function coming from?
Did you try to normalize the similarity matrix?
You can do this with Pandas too.
scaler = MinMaxScaler(feature_range=(-1, 1))
normalized_matrix = scaler.fit_transform(df)
normalized_df = pd.DataFrame(normalized_matrix, columns=df.columns, index=df.index)
I did not try it out by myself, but I hope this can give you some hints to start with.
From your matrix I see, that find ↔ love is more far away than love ↔ hate which makes sense, doesn’t it?
Someone will correct me if I’m wrong, but maybe you need to think about this differently?
You have a 2000 something dimensional space, and you’re surprised that “love” and “hate” are relatively close to each other.
They’re both words, they’re both expressing emotions, they’re both 4 letters long (probably not relevant), they’re both english, etc, etc.
In this gigantic vector space, they have much more in common than not, so it would be very surprising indeed if they were diametrically opposed.
Hope this helps
Thank you for leaving a comment.
I know cosine similarity can go down to -1. Even so, it was difficult to find words with a similarity of less than 0.7. (This is the part I haven’t found yet.)
Even with “인형” in Korean, which means “doll” in my native language, it showed a relationship of over 0.7.
If minmax scaling is used, -1 will be enough, but since this is not a value derived from cosine similarity, the meaning seems to be faded a lot.
Because of my poor English skills, I use a lot of translators, so please forgive me if there are awkward sentences.
Thank you for commenting.
I heard that you are right.
“hate” and “love” are very related words,
However, when looking at the cosine similarity with a value of [-1, 1], it is difficult to find a value less than 0.7 in reality.
If this is the best result, it seems that it is difficult to expect the same result as the example of the expression “King-Queen + Girl = Boy” that we have seen a lot.
Because of my poor English skills, I use a lot of translators, so please forgive me if there are awkward sentences.
I haven’t seen this done, but it does sound interesting. My understanding is that you need to do the arithmetic before you take the cosine similarity, to remove all the common dimensions. Is that the point of confusion?
One thing you could try is to find all common dimension indices in the vectors for (love/hate) and delete them, and then take the cosine similarity again, you should get the -1 you’re looking for. (you could set them to 0, but I don’t know how your cosine similarity algorithm responds to zeroes)
I hope I’m understanding you correctly
Thank you again.
Perhaps you understand it correctly.
Currently I have an embedding and I am trying to utilize triplet loss to fit my data. (However, it is not working well. It does not seem to have much effect on data other than training classification )