TL;DR Here is a lot of math describing the coherent combination approach in a fault tolerant way. Hopefully the notation isn’t too confusing, but it is used to be concise.
Yes, in my implementation, each chunk has N embedding vectors associated with it, and one Bag of Words (BoW) mapping for the keyword leg.
All the semantic legs can be combined coherently, which isn’t often talked about, as I see most people combining non-coherently with RRF or RSF. And there is some nuance if you want to combine “coherently” with the keywords, which I will explain below.
So on just the semantic side with embedding vectors, and N models, here is what you do:
Input text comes in, embed this with the N engines, in parallel. Get out N vectors of varying dimensions. The idea here is with coherent combination, you are synthesizing a massive embedding vector, and using this virtual high dimensional vector as your embedding.
For example, suppose you are using 10 embedding models, five (5) of them are of dimension 1024. Three (3) of them are of dimension 512, one (1) is of dimension 3072, and one (1) is of dimension 1536. So here is how you coherently combine these, in a fault tolerant way, that also synthesizes a vector of dimension 11,264, which contains more information than any individual model. And if some subset of embedding models go down, you get a synthesized vector of less dimensions, but they are also still combined coherently.
So for some notation, the input chunk X is mapped to 10 different vectors, E_1^X, E_2^X, ..., E_{10}^X.
You then take each of these E_i^X vectors, and correlate them to your target embeddings in your knowledge store. Let \Omega_i represent the collection of knowledge store vectors for embedding model i. Also, to give the final correlation some breathing room, form the set of top K correlations, say 10 \le K \le 50 might be a good place to start.
So if you want only the top match, you form this correlation, and get:
C_{i,K} = Top_K( \{E_i^X \cdot y \mid \ y \in \ \Omega_i \})
This should be a number between -1 and 1, assuming each embedding model consists of unit vectors, which is standard for embedding models these days, but know your model and pre-scale to unit length if it’s not for a particular model. Also save off the text behind the y here since this could be used in your RAG prompt. So you will get K chunks of text from your knowledge store from each of these.
OK, great. Now that you have the max correlations against your knowledge set, for each model, you combine these using a weighted average. You weighted average coefficients are:
\alpha_i, where \sum_{i=1}^N \alpha_i = 1
You would adjust the values of \alpha to weight certain models over the others, or you could weight them all equally by setting \alpha_i = 1/N.
OK, cool, so now your coherent combination is:
\rho_{i,k} = \sum_{i=1}^N \alpha_i * C_{i,k}
You will have N*K of these. So with 10 models and breathing room factor of K=20, you have 200 of these.
You then take the top 5, or top 1, whichever you want to allow into your prompt, as the final downselection.
Alright … but what if one of the models goes down?
So, this is easy to handle. For the case of one model down, say model 3 is down, you set a_3 = 0 but you have to redistribute this weight across the other weights, so you multiply each correlation by 1/(1-\alpha_3) to restore the weighted sum back to one. If you have two models out, say model 3 and model 9, you multiply by 1/(1-\alpha_3-\alpha_9)
So you would multiply the \rho_{i,k} values by these adjustment factors 1/(1-\sum_{q_z \subset \{1,2,...,N\}} \alpha_{q_z}) depending on which models are down. If none are down, this evaluates to 1, and so nothing is adjusted.
So now you can combine coherently across semantics, using multiple models in a fault tolerant way.
But what about combining with Keywords? These do not readily map to -1 to 1.
The traditional approach would be combine all of your semantic models first, coherently like I am advocating above for the best performance, then non-coherently combine this with the keywords using RRF.
So for RRF, or Reciprocal Rank Fusion, you would get a semantic rankings list, say ranks are the integers 1, 2, 3, 4, 5, and also the same rankings from the keywords list also 1, 2, 3, 4, 5.
You would then fuse these into one ranking by combining them reciprocally, same as a harmonic sum. So… equation:
RRF(d) = \sum_{d \in D} \frac{1}{c + r(d)}
Here d is the document, or chunk, that you are trying to rank, r(d) is the ranking value, so 1, 2, 3, … and c is a constant, here set it to 1 usually, but I see some folks like setting it somewhere near 60.
This is how you would combine your semantics and keywords non-coherently. If you want to reduce your keywords importance, set the numerator to somewhere between 0 and 1, for all rankings r(d) coming from a keyword sorting. Similar to weighting with the \alpha_i above, you could just have different \alpha_i as a function of which ranking you are on, so a generalized weighted RRF is:
RRF_{weighted}(d) = \sum_{d \in D} \frac{\alpha_i(r_i)}{c + r_i(d)}
Also, RSF above as another non-coherent technique. But I think this really only applies to non-coherent combinations across semantic searches. More here on RSF and RRF.
Finally, what about a pseudo-coherent combination with semantics and keywords? Well, I have been thinking about this one recently too. It would involve mapping the keyword correlations to the \tanh function (hyperbolic tangent), so it maps anything to an output range of -1 to 1, similar to the embeddings, and then you and coherently combine this as another pseudo embedding leg, without resorting to RRF.
Here I would fit your coherent combinations to solve for a bias and scale factor on the keyword correlation strength. I haven’t done this one yet, but the idea is to do some sort of least squares fit to solve for the bias and scale factor so that your keyword correlation strength is in the linear part of \tanh and correlates well to your coherent combinations from your ensemble of embedding models.