Embeddings giving incorrect results

lightning_missile · September 8, 2023, 6:18pm

Hello all

I am trying to create semantic search functionality on our internal company documents. I am very new at embeddings and have absolutely no idea about vector representations, so I appologize in advance if my question is stupid.

I am following the Search-Ask approach of the official openai docs called Question_answering_using_embeddings, but rather than using the 2022 Olympics, I am using our company’s documents. I am currently on the search step:

Collect all documents:
Using text-embedding-ada-002, I created embeddings for each document and store them into some vector database along with there textual content. I also added some initial texts before each content. The overall structure of the documents is the following (all documents describe internal policies.):

“Title: Policy 01
section: 1
Content:
The purpose of this policy is …”

Title and section are initial texts I added. I added the title because sometimes the policy numbers are not mensioned directly in the document. The section is for chunking. If a document is chunked into 2 embeddings, there will be “Section: 1” and “section: 2” with the same title.
Now, when a user asks a question, I create embedding for the question and pass it to my database to compare for relevant documents using my database’ cosine similarity function. This is where my problem started:

Suppose there is an internal policy called policy 69. This policy talks about work from home arrangements
If I ask: “What policy describes work from home arangements”. It correctly returned the policy 69 document.
However, when I ask: “What is policy69?” It returned a completely wrong document.
I tried using openai’s cosine similarity function from the python package but got the same result, so I guess the embedding is not the problem here. I guess there is something wrong with my understanding of embeddings.

So my questions are:

Why is my second query giving incorrect result?
Is there a way to fix it? Perhaps on the embeddings side or do I have to do additional steps? The second question is one of the critical features of this system.

Thanks!

sps · September 8, 2023, 7:56pm

Welcome to the community, @lightning_missile!

In simple terms, embeddings are great at capturing the semantic relationship between texts by representing them as vectors in multidimensional space.

One of the quickest ways to measure the relatedness of two embedding vectors is cosine similarity, which measures the angle between the two vectors. When the cosine similarity is 1, the vectors are completely aligned, indicating high semantic similarity.

To find the most semantically similar strings, we can calculate cosine similarity for every vector in our database or list and then sort them in decreasing order of cosine similarity.

For example, when using the text-embedding-ada-002 model for our embeddings:
The cosine similarity between “internet” and “network” is 0.9174819140544503,
while
the cosine similarity between “internet” and “sleep” is 0.8282069755064477.

This means that “network” is more semantically related to “internet” than “sleep” is.

In your case, it could be that the string “What is policy69?” is not semantically similar enough to the vectors in your knowledge base.

vb · September 8, 2023, 8:13pm

Hi!
Yes, that is a common issue and one solution is a hybrid approach where you use dense vector embeddings and a keyword search (sparse vectors, like BMF25).

The link below will give you a good starting point. Here is a quick quote:

BM25F is a variant of BM25 that allows multiple text fields per object to be given different weights in the ranking calculation. These weights are important for when fields in a document are more important than others. For example, a title may be given more weight than the abstract, since the title is sometimes more informative and concise.

Edit link: Hybrid Search Explained | Weaviate - vector database

lightning_missile · September 9, 2023, 2:26am

Hello, thanks for this explanation! Regarding semantic similarities, I just noticed that when I did not write it as a question, and just write it as: “Policy69”, it returned the correct document. I am surprised by the fact that the prefix "What is " has a big impact on the similarity calculation. Can you explain why that is? My tentative solution is to remove all the stop words from the query (the, is, what, etc) so it can calculate semantic similarities for relevant keywords only. Do you think this is a good approach?

lightning_missile · September 9, 2023, 2:29am

Hello! I am currently using an old version of elastic with the k-nn plugin as my vector database. There is no hybrid search functionality on this version yet. Is there an easy approach to manually do the hybrid search approach through code? Say I have 2 queries, one for keyword search and another for k-nn search, how do I normalize the score between them?

vb · September 9, 2023, 7:47am

One method is to normalize the scores of both the keyword search and the k-nn search to a range between 0 and 1.

Then use a weighted average to combine the two scores. The weights can be adjusted based on the importance you give to semantic relevance (k-nn score) versus exact match (keyword score).

For example: Combined Score = w1 * NormalizedKeywordScore + w2 * NormalizedKNNScore where w1 and w2 are the weights.

So you fetch scores from both searches, normalize each score to [0, 1] range and compute the combined score using the formula and rank documents based on this combined score.

You will have to experiment with the weights and evaluate the performance on sample queries to fine-tune the approach.

lightning_missile · September 9, 2023, 3:17pm

@vb thanks for this formula. Is it ok to have the same weights for both searches? What if I just calculate the average without weighting each search methods? Sorry I am new at these calculations.

anon22939549 · September 9, 2023, 4:19pm

That’s just a weighted average of 0.5:0.5. It’s perfectly fine to do and probably the right thing to do as a starting point.

lightning_missile · September 9, 2023, 5:48pm

Hi @anon22939549, @vb I just noticed something regarding normalization of scores in both keyword search and semantic search:
For semantic search, opensearch already has the score in the range 0,1 so I think I don’t need to normalize them (but it never seem to reach 1, the highest is always 0.9…)
For keyword normalization, I normalize them as follows:

Get the max_score. This is always the first result.
For each document score, compute score/max_score. This gives me the 0,1 range that I need.

The problem I noticed is using this formula, the first result always have a score of 1. Since the first document has the max_score, it always means max_score/max_score.
So for example, say I have both a semantic search query and a keyword query that return 3 results each. All 6 results are different. For example,

semantic_search_results = [s1, s2, s3[
Keyword_search _results = [k1, k2, k3]

In this case, the highest score is immediately k1, because there is nothing to combine, all results are different.
But this raises the issue that if all results are different, the return result is always a result from the keyword search. This feels odd.

I’m having a hard time normalizing the keyword search scores because opensearch does not have a possible maximum score, as oppose to cosine similarity with 1.
What can I do? Is there a better normalization formula? I’m thinking of adding a higher weight for semantic search to balance out the keyword scores. Is this a good idea? What would be a weight value?
Thanks again!

curt.kennedy · September 9, 2023, 6:48pm

Why not just use RRF (Reciprocal Rank Fusion)

It is independent of any normalizations between semantic and keyword, because it just using the rank ordering of the two streams.

lightning_missile · September 9, 2023, 11:43pm

Hi @curt.kennedy unfortunatly I am using an old version of opensearch, RRF is not yet available.

curt.kennedy · September 9, 2023, 11:50pm

No problem, I thought you were coding this yourself.

But for others that are able to write their own code against embeddings and keyword search, and want a hybrid answer of both, RRF is very easy to code up!

lightning_missile · September 10, 2023, 12:59am

Hi @curt.kennedy well I guess that is the next best option. I got the formula from the elasticsearch documentation. But I still don’t understand why it works: Here is the sample code:

score = 0.0
for q in queries:
if d in result(q):
score += 1.0 / ( k + rank( result(q), d ) )
return score

Can you help me with the following questions?

Does rank mean the position of the document in a list of documents sorted by descending scores? So the first highest scoring document (result[“hits”][“hits”][0]) is 1? And the second highest scoring document is 2, and so on …?
I still don’t understand the purpose of k. The docs says about rank_constant: “This value determines how much influence documents in individual result sets per query have over the final ranked result set.”. How does adding 60 to each reciprocal rank do that?
Can you give me an intuitive understanding of why this formula works? I am writing this as a python function and I plan to create unit tests for it but I don’t understand the formula at all.

Thanks

SomebodySysop · September 10, 2023, 1:04am

This was alluded to earlier, but if you are open to other search alternatives, Weaviate has a text2vec tokenizer which allows you to execute hybrid searches. The value here is that you can code your query screen to allow adjustments on the fly: text2vec-transformers | Weaviate - vector database

I’ve added it to mines like this:

Yes, I know, everybody says “Don’t give users options! It’ll only confuse them!” And, if you’re in that camp, so be it. But I find this to be a much more viable solution than trying to anticipate in my code what this should be for ALL searches in a fairly diverse dataset:

Just something to consider.

curt.kennedy · September 10, 2023, 1:40am

Yes, that is the most popular interpretation.

So for example, you have chunks X, Y, Z and Q. So you run embeddings, and get correlations sorted as [Y, X, Q, Z] and say keywords give sort of [Q, X, Z, Y] (highest to lowest).

So if we enumerate the rankings starting with 1 being the best, and 4 being the worst here, we get RRF scores of

X: 1/2 + 1/2 = 1
Y: 1/1 + 1/4 = 1.25
Z : 1/4 + 1/3 = 0.58
Q: 1/3 + 1/1 = 1.33

Based on this, Q ranks highest overall, slightly edging out Y.

The value of K is typically used as a penalty factor.

Suppose I don’t trust keywords as much as I trust embeddings. So I will artificially down-shift all the ranks by 1 of all keyword results.

So on the keyword side, I will set K = 1, to perform this rank downshift.

So if I recompute what I did before, here is what I get:

X: 1/2 + 1/(2+1) = 0.83
Y: 1/1 + 1/(4+1) = 1.20
Z : 1/4 + 1/(3+1) = 0.50
Q: 1/3 + 1/(1+1) = 0.83

So now, Y is the highest ranked item, whereas Q is now at a lower rank (2nd place), tied with X.

Also, your penalty factor on a stream (this is what K is), need not be an integer. You could slightly penalize keywords, by setting it to 0.50, for example.

So K is just the “rank shift operator”. It gives you the freedom to prioritize one stream over another. For example, if you set K = 100 on the keyword stream, your embedding results would always dominate.

Does this make more sense?

lightning_missile · September 10, 2023, 2:15am

@curt.kennedy yes I think I understand it now! So what if I do not want to prioritize any search method? Can I just remove k? I’m just confused as to why the default value of the k factor is 60 as stated in the docs.

curt.kennedy · September 10, 2023, 2:27am

If you want everything equal, and you are ones based, so the ranks go 1, 2, 3, … instead of 0, 1, 2, … then set K = 0 for both streams. For zero based, so ranks are 0, 1, 2, … then set K = 1. Why? It’s just so you don’t divide by 0

Setting K = 60 for everything is a way of blunting the entire ranking, and not sure why it’s useful personally, since it is still monotonic for fixed K on all streams. All this does is say, instead of rank 1, 2, 3, on items, I instead have rank 61, 62, 63, …, so what is the point of this? Nothing really.

lightning_missile · September 10, 2023, 3:45am

@curt.kennedy wow this RRF is a beautiful solution! Thanks for your explanation. I will implement this. One final question. Why do we do reciprocal rank (1/rank)? I understand this might be a very braud question, but can you provide some pointers in the context of combining scores? When I did not do reciprocal ranking, [x,y,z,q] gave me results of [4,5,7,4] which is totally different from the reciprocal ranks. Basically, I now understand the fusion step, I’d like to understand why the reciprocal ranking itself works as expected. My guess is to make the scores between 0 and 1, but I can’t rap my head as to why it works. Appologize for the basic questions, I know nothing about statistic calculations like this.
Thanks again!

curt.kennedy · September 10, 2023, 3:57am

It’s basically an unnormalized harmonic mean.

But beyond this technicality, just think of it as an easy way to map correlations that have a poor ranking (high rank number) to a small number. So rank 100 maps to 1/100 = 0.01.

But looking a little deeper at the 1/x curve, it keeps emphasis on the high ranking items and pushes the remaining poor ranking items into the noise.

A line wouldn’t work since you would eventually dip negative, but also, all the medium ranking values would compare readily with the higher ranking items for the linear case. So this non-linear 1/x term emphases top winners in each stream category.

At least that’s my intuition as to why it’s useful.

lightning_missile · September 10, 2023, 4:05am

@curt.kennedy So I think I’ll just implement RRF to combine both semantic and keyword search results. Thanks! I learned a lot.

Topic		Replies	Views
How to perform Search using models fine-tuned on technical domains? API	13	1948	March 22, 2022
Asking follow-up questions with retrieval API	7	6143	August 15, 2023
BERT better than Ada 002? API embeddings , api , ada002	11	6287	November 13, 2023
Document Retrieval in Large Database Community embeddings	4	3806	October 27, 2024
RAG is failing when the number of documents increase API	35	16329	December 17, 2024

Embeddings giving incorrect results

Related topics