Data without meaning gets the highest similarity score

I have the following case. We have two English documents containing technical data about engines. These are chunked and embedded using the Langchain framework and OpenIAEmbeddings, using the model text-embeddings-ada-002. Due to some processing of the data there is a chunk that only contains --- (a markdown horizontal rule).

When providing a question in English to retrieve the relevant chunks, using the same model and a cosine similarity, this chunk is ranked 22nd and has a score of 0.732305… The first item has a score of 0.780513…

However, providing the same question in another language, moves this irrelevant chunk as the first matched item, with a score of 0.737243… While the first item for the English version is moved down to the 4th position (score 0.727817…).

Now I do understand that the matches can vary between different languages, but I do not expect that 21 somewhat relevant matches in English, become less relevant than --- when the same query is used in a different language. Can someone shed some light on this? Why is --- not always at the bottom of the list, as it contains no meaning?

I have the same problem. Did you solve it?

Cosine Similarity is not very precise. In search we measure Information Retrieval on two dimensions; precision and recall. Semantic Search has great recall but poor precision. What that means is that you generally guaranteed to get the most relevant result back in your result set but the odds of it being towards the top of the list are low. Typically precision is around 60% out of the box with cosine similarity.

To get around the precision issue the state of the art systems re-rank the results after they’re returned using more traditional IR algorithms but this yields at best 80% precision and much is still poor. The best approach is to just over sample. You need to show the model more chunks then what you think you should need to show it.

1 Like

Unfortunately I do not have a solution yet for the raw chunks. There are some techniques to try to improve the chunks, for example when chunking your documents, make automated summaries for those chunks, or sets of chunks. These summaries will probably eliminate the irrelevant data. There are also techniques that can cluster chunks from different documents with similar data and summarizes those clustered chunks. Using both the summaries and the raw chunks might increase the relevant context for the answer. Perhaps by having more relevant data, the irrelevant chunks will be ranked lower.

I’m not a fan of pre-processing the data, as it could potentially remove relevant data.

Do you have any other ideas?