We just deployed the latest version of discourse-ai here which enabled HyDE based semantic search here.
What is HyDE?
As explained by our Discourse GPT-4 bot - which has access to Google
Search Google
Found 674000 results for 'HyDE based semantic search explanation'
HyDE, which stands for Hypothetical Document Embeddings, is a technique used in semantic search to find documents based on similarities in semantic embedding. It’s a zero-shot learning technique, meaning it can make predictions about data it has not been trained on.
In the context of search, HyDE works by generating a hypothetical answer to a query using a language model. This hypothetical answer is then embedded into a vector space, similar to how real documents are embedded. When a search query comes in, similar real documents are retrieved based on vector similarity to the hypothetical document. This allows for a more precise and relevant retrieval of documents, even when the exact terms used in the search query may not be present in the documents.
The aim of HyDE is to improve the quality of search results by focusing on the underlying intent of the search query, rather than just the exact words used. This makes it particularly useful for tasks like question-answering, where the goal is to find the most relevant information to answer a user’s question, rather than just finding documents that contain the exact words used in the question[1][2][3].
“Dense retrieval, a technique for finding documents based on similarities in semantic embedding, has been shown effective for tasks including…”[1:1]
“Given a query, HyDE first zero-shot instructs an… where similar real documents are retrieved based on vector similarity.”[2:1]
“This way, when searching, matches can be made based on the underlying intent… The HyDE hypothesis is that the document search would yield better results…”[3:1]
How is it implemented here?
When you perform a full page search such as:
How do I count tokens in function calls effectively
-
We perform the normal keyword based search
-
In the background we make a call to GPT-3.5 to hallucinate an answer:
- Once the answer is hallucinated, we embed it using
text-embedding-ada-002
- we perform a vector similarity search using pgvector
How good are the results?
It really depends on the query, the more complex and advanced the query the higher the odds semantic search will give you more interesting results.
For example for:
How do I count tokens in function calls effectively
To results for traditional search in this case are:
- Function call limit count - #5 by Foxabilo
- Tattoos & Coding With ChatGPT
- My most important function is being called only very rarely - #7 by _j
- ChaosGPT: An AI That Seeks to Destroy Humanity - #34 by curt.kennedy
Semantic search on the other hand gives us the far better results:
- What is the OpenAI algorithm to calculate tokens?
- Counting tokens for chat API calls (gpt-3.5-turbo)
- How to calculate the tokens when using function call
- Is Tokenizer.from_pretrained("gpt2") the same tokenizer used in your GPT3 and ChatGPT models?
Semantic search is orders of magnitude better than keyword search for this example.
Feedback
Let us know what you think, the AI team at Discourse are listening!
Big thanks to @Falco and @roman.rizzi for building the feature