Some questions about text-embedding-ada-002’s embedding

curt.kennedy · January 13, 2023, 2:28pm

The engine doesn’t seem to have a wide angular distribution from my experience. And I have no idea why either. I posted another thread on this last week. I don’t think anyone knows why, so I was thinking of doing a deeper dive.

As background, I embedded 80k random texts and phrases. If I pick one at random, and run a vector search on the top 10 most opposite texts, I get cosine similarities similar the the one you had, which is around 45 degrees. It totally had me wondering if the model has much of its vector space dedicated to other things.

Now one thing in my case, and maybe it is the same for you, is all the texts are relatively short. If it covers more area of the vector space based on length of the text, this might explain it.

But intuitively, you would think it embeds on semantic similarity, not length. However, when looking at the closest texts, it will return things that are related, but the sentiment could be opposite. This isn’t a big deal to me, but I found that one interesting too.

Topic		Replies	Views
Question on text-embedding-ada-002 API	12	6357	December 24, 2023
Can text-embedding-ada-002 be made deterministic? API embeddings , ada	18	7618	December 24, 2023
Why `OpenAI Embedding` return different vectors for the same text input? API	35	10068	April 30, 2024
Embeddings and Cosine Similarity API	20	14167	February 25, 2024
Creating a Chatbot using the data stored in my huge database Community embeddings , chatgpt , fine-tuning , api	93	85677	November 25, 2023

Some questions about text-embedding-ada-002’s embedding

Related topics