Semantic search through embeddings

Hi Guys,
I have the following code.
embedding_a is the question from a faq, embedding_b is the answer to that question and _c and _d and _e are different questions from that FAQ

my hope was that _a and _b have the greates similarity. however the output is this

0.8239813593399501 0.7841262926596749 0.8586356206237657 0.8292298928118564

meaning that the right question/answer pairs score poorly (3rd place out of 4)
Is that just a data problem that cant be helped then or am I missing something?

embedding_a = resp['data'][0]['embedding']
embedding_b = resp['data'][1]['embedding']
embedding_c = resp['data'][2]['embedding']
embedding_d = resp['data'][3]['embedding']
embedding_e = resp['data'][4]['embedding']

similarity_score_0 = np.dot(embedding_a, embedding_b)
similarity_score_1 = np.dot(embedding_a, embedding_c)
similarity_score_2 = np.dot(embedding_a, embedding_d)
similarity_score_3 = np.dot(embedding_a, embedding_e)


print(similarity_score_0, similarity_score_1, similarity_score_2, similarity_score_3)
2 Likes

I suspect you will have to share the actual questions and answers to see what is going on.

2 Likes

I’m totally with @anil.nair who commented:

What a polite reply :slight_smile:

Of course the only way to reply to this question is to see the actual text being processed and to know the exact model used.

The Vectors come from the OpenAI dataset and the model used, not a linear analysis of the text in the strings.

This Ruby code where I manually created a cosine similarity function and of course using the OpenAI API illustrates this:

irb(main):017:0> Embeddings.test_strings("I like dogs more than cats.","dog")
=> 0.8222681455164221
irb(main):018:0> Embeddings.test_strings("I like dogs more than cats.","cat")
=> 0.8079475470848626
irb(main):019:0> Embeddings.test_strings("I like dogs more than cats.","puppy")
=> 0.7848079985269432
irb(main):020:0> Embeddings.test_strings("I like dogs more than cats.","kitten")
=> 0.8015104340445109
irb(main):021:0> Embeddings.test_strings("I like dogs more than cats.","animal abuse")
=> 0.815095683788281
irb(main):022:0> Embeddings.test_strings("I like dogs more than cats.","pets")
=> 0.8299526417237735

So based on these vectors, we can say that “according to the global internet”, the texts in the test functions are similar, not based on linear text analysis, but on how the model “perceives” the world via the eyes of the global internet.

That, at least in my mind, is one reason why OpenAI has made these vectors available via their API.

Hope this helps.

1 Like

thanks. Ill look into that more deeply