Some questions about text-embedding-ada-002’s embedding

debreuil · January 27, 2023, 7:48am

Reading through that paper, it made me think these embeddings might be encoding how common a word is as well as semantic meaning. I’m only interested in the conceptual meaning for what I’m doing, so I wanted to verify that. And try to subtract that out as well if it’s true.

From my initial tests, it seems that is true. I have 100 sentences made by giving chatGPT a list of the 50 most common words, and another list to avoid them (as well as pronouns etc). Examples are:

He is a good man.
The flowers in the garden are beautiful.
vs
Ravaged city bears scars of war.
Velociraptors roamed prehistoric savannah.

I made them all isomorphic, and made an image from the sums of their embeddings (more red is more positive, more blue is more negative). At least with this test it is clear the common words (first image) have generally lower values, and uncommon words a higher values. These images are just normalized 48x32 images made from the 1536 embedding values directly.

This is a first pass, but I think there is a signal there. It makes sense the word frequency is embedded, but the fact that common tends to be low seems a bit surprising.

Topic		Replies	Views
Question on text-embedding-ada-002 API	12	6409	December 24, 2023
Can text-embedding-ada-002 be made deterministic? API embeddings , ada	18	7849	December 24, 2023
Why `OpenAI Embedding` return different vectors for the same text input? API	35	10446	April 30, 2024
Embeddings and Cosine Similarity API	20	14458	February 25, 2024
Creating a Chatbot using the data stored in my huge database Community embeddings , chatgpt , fine-tuning , api	93	87814	November 25, 2023

Some questions about text-embedding-ada-002’s embedding

Related topics