Reading through that paper, it made me think these embeddings might be encoding how common a word is as well as semantic meaning. I’m only interested in the conceptual meaning for what I’m doing, so I wanted to verify that. And try to subtract that out as well if it’s true.
From my initial tests, it seems that is true. I have 100 sentences made by giving chatGPT a list of the 50 most common words, and another list to avoid them (as well as pronouns etc). Examples are:
He is a good man.
The flowers in the garden are beautiful.
vs
Ravaged city bears scars of war.
Velociraptors roamed prehistoric savannah.
I made them all isomorphic, and made an image from the sums of their embeddings (more red is more positive, more blue is more negative). At least with this test it is clear the common words (first image) have generally lower values, and uncommon words a higher values. These images are just normalized 48x32 images made from the 1536 embedding values directly.
This is a first pass, but I think there is a signal there. It makes sense the word frequency is embedded, but the fact that common tends to be low seems a bit surprising.