How to improve semantic search accuracy?

Hi there,

I am facing a bit of a challenge trying to build a proper semantic search program. Long story short, I am still facing lots of “false positive” (basically I am retrieving pieces of text that are not directly related to my search).

Here is some background regarding the project. I have a database semantic information of ~50k words. The database contains midly complex text related to data privacy (it is not as complex as law & legal documentation, but it can be a bit technical or specific sometimes).

I am trying to perform asymetric search. My query uses a question & is supposed to retrieve any relevant information that could help answering the question.

I was hoping that using a pretty solid embedding function such as ada-002 would fix the issue, but I am still getting pretty bad semantic search 20% of the time.

I tried a few things like combining two search together (one using similiarity search, the other using MMR) & keep the best of those two results. I have also tried to create “semantic clones” of my question to enhance the semantic matching (basically I am asking the same question with 3 different wordings to make sure its topic / meaning is well captured). I have also tried to refine the semantic search result using a query to GPT4 with clear prompt instruction to isolate what is relevant and what’s not…

But overall, my problem hasn’t been fixed using those technics.

I have heard a few things about additional methods such as combining several embedding functions together, or even using dynamic k_fetch & k selection for MMR or eventually doing cosine distance re-assessment through several iterations… Not sure if those are best practices or if I am missing something here.

To conclude, I was really hoping that the ada-002 quality would be the solution, but it seems that it does not fixes everything.

What would be your recommendation at this point?

Thanks in advance for the help.

Hi and welcome to the Developer Forum!

Some examples would be useful, if you are asking GPT-4 for it’s best search string given a particular users enquiry and you are still receiving low quality results then it very likely down to the either you have low quality data being stored or you are storing the data in a less than ideal way.

I suspect a combination of source material cleaning, making sure you do not have spurious entries like page formatting and other unrelated non domain specific information removed first and then also using embedding overlap, where part of the prior and part of the next chunk are included with the current embedding to enable cross chunk semantically valid retrieval would help.

Overlapping 25% of the prior chunk and 25% of the next chunk with 50% of the current chunk on a sliding window basis will ensure that any topics which cross over a boundary will always be contained within a single chunk. This should increase your results to the 90th-95th percentile, assuming high quality, clean data in those chunks and using a solid prompt from gpt-4 to interpret a users query in the best way possible.

Thanks a lot for the reply @Foxabilo

Definitely happy to look at the way the data is stored & do some cleaning if necessary.

Here is an example a first example: I get a decent match, but I also get extra information that might seem close enough to the question but that is actually unrelated.

Question(s): What sort of sensitive data do you collect or store? what kind of confidential information do you gather or keep? do you hold or acquire any delicate personal details?

Semantic search (similarity): [(document(page_content=‘recommendation would be to avoid processing such data altogether. sensitive data: data consisting of racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, genetic data, biometric data, data’, metadata={}), 0.35815608501434326), (document(page_content=‘data. companies with >250 employees or where the processing is frequent or the processing applies to sensitive data categories are required to maintain such record of processing while there can be an exception for other companies. processing’, metadata={}), 0.36699146032333374), (document(page_content=‘children data: to operate on children data, collecting a consent for from parent or legal guardian is mandatory. sensitive data: processing sensitive data is prohibited by default and can happen only under specific circumstances, one general’, metadata={}), 0.37631654739379883), (document(page_content=‘contain in particular their company details, the purposes of processing, categories of data, recipients to whom personal data are disclosed, transfers of personal data to a third country, time limits for erasure of different categories of data, and’, metadata={}), 0.38764485716819763)]

Refined extract (after querying GPT4 for cleaning): Sensitive data can include details such as racial or ethnic origin political opinions religious or philosophical beliefs trade union membership genetic data biometric data. Processing such data is prohibited by default unless under specific circumstances. For companies with more than 250 employees or where the data processing is frequent or involves sensitive data categories maintaining a record of such processing is mandatory. Additionally operating on children’s data requires the consent of a parent or legal guardian.

In that example the part regarding the sensitive data is correct, but the part regarding 250 employees & children data (in bold) is out of scope.

Here is another example where I get a completely wrong match, clearly out of topic.

Question(s): Do all our third-party integrations support the gdpr’s data portability requirement, and how can we technically facilitate that? are all integrations with third parties in compliance with gdpr mandates on data portability, and what technical means can ensure this? is every third-party tool we integrate with adhering to the data portability specifications of the gdpr, and how can this be technically enabled?

Semantic search (similarity): [(document(page_content=‘portability. salesforce: salesforce prioritizes gdpr compliance by processing data on behalf of organizations, ensuring that personal data is not utilized for its own business purposes. with strict internal measures, salesforce commits to top-tier’, metadata={}), 0.2967739701271057), (document(page_content=‘to maintain gdpr compliance. specifically, notion has a data processing addendum in place for situations where notion processes any personal data subject to the gdpr. notion offers data portability and management tools that allow users to import’, metadata={}), 0.29792970418930054), (document(page_content=‘to be gdpr-compliant, the responsibility to utilize these tools correctly and maintain gdpr compliance lies with the user. for instance, understanding and defining the basis for data collection (e.g., consent or legitimate interest) is essential.’, metadata={}), 0.3150405287742615), (document(page_content=‘guarantees similar privacy & security principles as gdpr (see data transfer) data back-up: ensure that data backed-up followed the same rules as for gdpr, especially in terms of processing, storage, location and other general requirements.’, metadata={}), 0.32765448093414307)]

Refined extract (after querying GPT4 for cleaning): Salesforce and notion are two platforms that comply with gdpr’s data portability requirements. Salesforce accomplishes this by processing data on behalf of organizations while notion provides tools for data portability and management. However it’s crucial to note that the onus of using these tools correctly and maintaining gdpr compliance lies with the user. This includes understanding and defining the basis for data collection. Furthermore data backups must follow the same rules as for gdpr particularly around processing storage location and other general requirements.

In that case, everything in a mistmatch. And I was at least hoping to get the following paragraph from my DB “Users have the right to ask for their data portability to be transferred to another company or be directly sent to them. The data portability must be provided in a machine-readable electronic format.”

And finally here is a more tricky one. Everything in the match is thematically related to the question, but a significant part of it is not entirely relevant.

Questions(s): Where do you store your data? in which location is your data stored? in what country is the data stored?

Semantic search (similarity): [(document(page_content=‘customers to address gdpr rights like rectification, erasure, and access. data storage options include centers in frankfurt, germany, with backups in ireland, ensuring eu data remains within the region unless moved under specific gdpr-compliant’, metadata={}), 0.9658189415931702), (document(page_content=‘measures for updating the data that might lose in relevance over time, then the storage period doesn’t seem proportionate to the purpose of data storage storage location: data has to be stored in eu (european union) countries or in a country that’, metadata={}), 1.093562126159668), (document(page_content=“fullstory offers detailed documentation and functional enhancements to give businesses confidence in handling gdpr obligations. when data is stored, it’s processed within google cloud platform’s centers located both in the us and the eu, ensuring”, metadata={}), 1.1188565492630005), (document(page_content=‘your data flows and know when to collect consent to collect, process & store. mandatory gdpr documentation: a company should maintain records of data processing activities (processor & controller) as well as about its dataflow, the different 3rd’, metadata={}), 1.142439842224121)]

Refined text (after GPT4 query): Data is stored in centers in frankfurt germany with backups in ireland ensuring eu data remains within the region unless moved under specific gdpr-compliant measures. Additionally when data is stored it’s processed within google cloud platform’s centers located both in the us and the eu.

In that case everything is wrong because the whole text refers to specific situations (that are technically referring to the question thematic). So not only I’d like to avoid retrieving such text but I would also have expected to get at least this part from my DB: “storage location: Data has to be stored in EU (European union) countries or in a country that guarantees similar privacy & security principles as GDPR (see data transfer)”

And in case that matters, here is my GPT-4 prompt (which I admit is fairly complex but definitely necessary as I got my best results with this one so far):


"Q&A - refine query2 - eng" : """ 
Context: We are talking about data privacy, data regulation & GDPR topics.
Question / Topic: {query}

Role: You are to act as a data privacy & GDPR expert, your job is to help identify relevant information & build knowledge.

Instruction1: Consider the Question / Topic from above and understand its main thematic
In the documents / context / information provided, select the piece of text that could help build knowledge on the question thematic.
Consider anything that could be informational to better understand the question and its thematic and that could help building knowledge related to the question. 
Keep the knowledge selection very short & specific. Be particularly careful by selecting information that is relevant and specific to the question / topic / thematic. Exclude elements that are unrelated or that don't help building knowledge around the question thematic.
Make sure to be very selective, this is crucial. I insist, be very selective.

Instruction2: In the documents / context / information provided, you might find some information about companies or technological solutions. 
Companies could be (but is not limited to) Meta, Google, Microsoft, salesforce, Pimeyes, Shopify, Zoho, Wordpress, Jotform, etc. This also includes (but is not limited to) technological solutions such as hotjar, monkeysurvey, mixpnale, google analytics etc.. 
Never select information related to those companies nor mention those companies. Systematically disregard information related to a company. Make sure to stricly apply this rule. 
You can only ignore this guideline if the company or the technological solution is explicitly mentionned in the Question / Topic.

Instruction3: Don't try to answer the question directly, you are not the one to whom the question is addressed, you are here to help gaining knowledge.

Instruction4: Feel free to use your own knowledge to complete your answer in case you did not extract much information from the documents / context / information.

Instruction5: If there is not relevant information and if you don't have any specific knowedge, don't try to make up something, just indicate you don't have any relevant information to provide'.

Note: If the Question is about making a choice between different options, make sure to cover that part & to provide a recommendation. If you really are not able to provide a recommendation, then say that it is difficult to prioritize one option over another.

About your answer: Never use a wording that refers to the document or the context provided. 
For instance, do not use terms like "The provided context does not specifically mention..." or "The provided documents emphasize the importance". 
Just focus on the information itself. 
Strictly follow this guideline.

Format: Give me a text written in a simple & understandable format (make sure to keep all details). Be direct, go straigth to the point, make it easy to read.
No additional sentence nor explanation required, make sure to follow this guidance.""" ,

To answer your different points:
-The whole document is a .txt file with only linebreaks but nothing else in terms of formating.
-The text is organized on a paragraph per paragraph basis. Each paragraph starts with a topic name and then a topic description (for instance “Google Analytics illegal in EU: …”
-I might have messed up with the chunking as I only used 10% overlap and tried pretty small cunks (I tried chunks of 250 to 750 words). I’ll give it a try with 25% and maybe increase the size? (as a reminder my database is a document of <50k words)

Thanks a lot

You are asking the bot three questions at once, this will confabulate the situation. Solution, ask one question at a time, or parse out each question and ask it one at a time.

You have a lot of negatives like “not” and “don’t”. I would avoid these in your prompt.

The chunks returned look small, try increasing the chunk size.

If that doesn’t work I have a million other ideas that could help. But start with the basics and clean up these obvious problems first.

1 Like

Sounds like a plan. Thanks a lot, I’ll try that and keep you posted.