Semantic search through embeddings

heiko · December 8, 2022, 5:02pm

Hi Guys,
I have the following code.
embedding_a is the question from a faq, embedding_b is the answer to that question and _c and _d and _e are different questions from that FAQ

my hope was that _a and _b have the greates similarity. however the output is this

0.8239813593399501 0.7841262926596749 0.8586356206237657 0.8292298928118564

meaning that the right question/answer pairs score poorly (3rd place out of 4)
Is that just a data problem that cant be helped then or am I missing something?

embedding_a = resp['data'][0]['embedding']
embedding_b = resp['data'][1]['embedding']
embedding_c = resp['data'][2]['embedding']
embedding_d = resp['data'][3]['embedding']
embedding_e = resp['data'][4]['embedding']

similarity_score_0 = np.dot(embedding_a, embedding_b)
similarity_score_1 = np.dot(embedding_a, embedding_c)
similarity_score_2 = np.dot(embedding_a, embedding_d)
similarity_score_3 = np.dot(embedding_a, embedding_e)


print(similarity_score_0, similarity_score_1, similarity_score_2, similarity_score_3)

anil.nair · December 10, 2022, 12:08am

I suspect you will have to share the actual questions and answers to see what is going on.

ruby_coder · January 22, 2023, 9:18am

I’m totally with @anil.nair who commented:

What a polite reply

Of course the only way to reply to this question is to see the actual text being processed and to know the exact model used.

The Vectors come from the OpenAI dataset and the model used, not a linear analysis of the text in the strings.

This Ruby code where I manually created a cosine similarity function and of course using the OpenAI API illustrates this:

irb(main):017:0> Embeddings.test_strings("I like dogs more than cats.","dog")
=> 0.8222681455164221
irb(main):018:0> Embeddings.test_strings("I like dogs more than cats.","cat")
=> 0.8079475470848626
irb(main):019:0> Embeddings.test_strings("I like dogs more than cats.","puppy")
=> 0.7848079985269432
irb(main):020:0> Embeddings.test_strings("I like dogs more than cats.","kitten")
=> 0.8015104340445109
irb(main):021:0> Embeddings.test_strings("I like dogs more than cats.","animal abuse")
=> 0.815095683788281
irb(main):022:0> Embeddings.test_strings("I like dogs more than cats.","pets")
=> 0.8299526417237735

So based on these vectors, we can say that “according to the global internet”, the texts in the test functions are similar, not based on linear text analysis, but on how the model “perceives” the world via the eyes of the global internet.

That, at least in my mind, is one reason why OpenAI has made these vectors available via their API.

Hope this helps.

heiko · January 22, 2023, 3:18pm

thanks. Ill look into that more deeply

Topic		Replies	Views
Use embeddings to measures how well an answer fits the question API embeddings	5	293	June 29, 2024
Embeddings and Cosine Similarity API	20	14103	February 25, 2024
Inconsistent Embedding Results for my dataset API embeddings	1	64	November 14, 2024
Data without meaning gets the highest similarity score API embeddings	3	732	March 19, 2024
`text-embedding-ada-002` API	23	16859	February 6, 2024

Semantic search through embeddings

Related topics