Some findings on the meaning of embedding. Discuss with the example of "woman - man = queen - king = female"

Recently, I’ve been using the embedding function and have made some interesting discoveries that I would like to discuss with everyone. I’ve been using the text-embedding-ada-002 model.
One commonly cited example is “woman - man = queen - king = female,” as shown in the following image.
The circle is because all the points should be in length of 1, where should all on a circle if it is 2D.
After doing some experiments, below is the result. If there is a arrow, for example, Man → Woman, it means that Emb(Woman) - Emb(Man).
The relation is not that high comparing to my assumption.
Then, I came up with an idea: Is 1,536 dimensions too high and the amount of information contained in each word too small?
Then I did another experiment is that using descriptions by ChatGPT instead of words.

  • description_man = ‘A male human being, typically distinguished by physical characteristics such as a deeper voice, facial hair, and greater height and muscle mass than females. Men have played significant roles in history and society, and have been involved in various fields such as science, politics, arts, and sports.’
  • description_woman = ‘A female human being, typically distinguished by physical characteristics such as breasts, wider hips, and a higher-pitched voice than males. Women have also played significant roles in history and society, although they have faced challenges such as gender discrimination and inequality. Women have made contributions to various fields such as science, politics, arts, and sports.’
  • description_king = ‘A male monarch who typically inherits his position by birthright and rules over a kingdom or an empire. Kings have played significant roles in history, and have been regarded as powerful figures with the ability to make important decisions that affect the lives of their subjects.’
  • description_queen = ‘A female monarch who typically inherits her position by birthright, or who marries a king. Queens have also played significant roles in history, and have been regarded as powerful figures with the ability to influence the decisions of their monarchs. Queens have been involved in various fields such as politics, arts, and philanthropy.’

The similarity increase a lot!!!
My inference is that it is possible that with the Ada model, it has learned too much knowledge (compared to the previous Word2Vec model), so the word “Man” may not only express male gender, but also a nickname, a location, a song, etc. Therefore, the difference between “Man” and “Woman” is not just gender.
I’m curious to know what everyone thinks about this research.

3 Likes

Very interesting research.
Why use subtraction as distance? Instead of the more common cosine Distance:
Dcos = 1 - Scos where Scos is the cosine similarity.

How to perform the Description calculations?
Option 1. (e.g.) Emb(description_woman) - Emb(description_man)
Option 2. Counting all equal or similar words in both descriptions.
Option 3. Arithmetic Mean: AM( Emb( Word(description_woman; 1 to end_n) ) - AM( Emb( Word(description_man; 1 to end_n) );
Option 4. Weighted Mean - same as option 3 but with weights;
Option 5. Any other method;

Please advise - I am curious about - for quite some time I’ve been looking for such good methods for word matching. And yes, 768 dimensions for a word is too much in the recommended range of 100 -1000 for word embeddings. It seems OpenAI is trying to achieve 95% of certainty in a normal distribution.

The dimensions and their meanings in the text-embedding-ada-002 model are not explicitly documented by OpenAI, as they are considered part of the model architecture and implementation details. No paper or documentation. I wonder if OpenAI used embeddings in its token engineering.