I think what might be confusing about OpenAI embeddings is that the embedding vector for a phrase like “Anything you would like to share?” is based on an OpenAI model derived from text on the global internet. The same is true for the embedded vector for “I need to solve the problem with money”, the vector is derived from the OpenAI AAN combined with a particular training model.
The embeddings (vectors) are not based on a direct analysis of text, but on the OpenAI model based on the huge dataset used in the ANN. This is, at least, my current understanding.
So, using some Ruby code I cobbled together (using my own cosine similarity function, not from a library), let’s look at this:
irb(main):013:0> Embeddings.test_strings("I need to solve the problem with money","Anything you would like to share?")
=> 0.7614775318811315
irb(main):014:0> Embeddings.test_strings("I need to solve the problem with money","What is your financial situation?")
=> 0.8475256263838489
irb(main):015:0> Embeddings.test_strings("I need to solve the problem with money","Fraud")
=> 0.7632965853455049
irb(main):016:0> Embeddings.test_strings("I need to solve the problem with money","CitiBank")
=> 0.7823379047316411
If we rank these, the most similar are, in descending order:
- “What is your financial situation?”
- “CitiBank”
- “Fraud”
- “Anything you would like to share?”
These makes perfect sense to me, as being similar to “I need to solve the problem with money”.
So, based on what we might expect to see on the global internet, the above cosine similarities of embeddings vectors based on the text-embedding-ada-002
seems normal to me.