HyDE with hybrid search approaches

Hello all,

I have a system that do question answering with hybrid search (keyword search and vector search). Now I want to integrate HyDE to see what difference it would make to the generated answers.
Currently, my system works as follows:

  1. User asks a question.
  2. Chat GPT converts the question into search terms.
  3. Perform keyword search on my database with the search terms (elasticsearch in this case).
  4. Perform semantic search on elasticsearch with the search terms.
  5. Use rRF to normalize, combine and rerank documents.
  6. Select all the documents that can fit into a specified token budget, ask the question to chat GPT with the selected documents, and get the answer.

Now, I have trouble integrating HyDE to my overall process after reading some example implementations. In particular, where should I actually perform semantic search with the hypothetical answer? I have 2 approaches but am not sure if this is right.

  1. Should I do it during the semantic search step of my current process? That is, use the search terms on keyword search on the whole database, and then use the hypothetical answer for semantic search on the whole database, and rerank the documents to complete the hybrid search?
  2. Should I do it on the results retrieved by keyword search? I think the downside to this is that the keyword search results are now being prioritized, but the examples I found always do this (I.E. getting results from news API on the openai site).

Can anyone help me? What is the best way to integrate hybrid search with HyDE? Thanks.

1 Like

Maybe do a 3-leg RRF.

Here would be a potent combo:

  1. Normal embedding leg.
  2. Keyword leg.
  3. HyDE leg. So Question —> Hyde Answer —> Correlate with embeddings or Keywords or both —> get additional rankings for RRF depending on how many legs this turns into.

Hi. Yes I am considering this as well, but I fear the overall system might be more inefficient if I do lots of rerankings.

@curt.kennedy regarding the HyDE leg, is there a big difference if I do:

  1. do keyword search.
  2. do semantic search.
  3. RRF.
  4. Correlate the ranked results with the Hyde Answer by reranking again.

vs this one?

  1. do keyword search.
  2. Do semantic search.
  3. Correlate the keyword results with the hyde answer.
  4. Correlate the semantic results with the hyde answer.
  5. rerank using rrf.

I wasn’t thinking of correlating previous results with the HyDE answer.

I was thinking you get pure answers from the original text by using HyDE, semantics, and keywords. So this is (3) streams of rankings, and then use RRF to fuse them into one ranking.

The only nuance, is that HyDE could be considered some sort of new query (it produces a synthetic query from the original one). So with this you could do 2-leg RRF, one with semantics on the HyDE generated query, one with keywords on the HyDE generated query.

So putting all this together, you have (4) streams (max) to fuse in RRF.

  1. Semantic on original query
  2. Keywords on originally query
  3. Semantic on HyDE generated query
  4. Keywords on HyDE generated query

All of these can be run in parallel, and have no dependencies between them. So do this, and when the last one finishes, fuse them all to a single ranking using RRF.

This is different than what your are saying above, because I am not correlating results from keyword or semantic with anything from HyDE. I am treating each leg as an independent processing stream, which is good for lowering the latency of the overall search.