Essentially, you want to maximize the amount of relevant information you retrieve, right?
Imagine you’ve embedded a huge number of documents. Some are revisions of earlier documents, some are summary documents, etc.
Let’s say the ideal retrieval ultimately requires combining three different sets of facts or ideas, let’s call them \mathcal{A}, \mathcal{B}, and \mathcal{C}.
You might have many documents that relate to each of these ideas separately, and even several that relate to \mathcal{A} \cap \mathcal{B}, but no documents that relate to all three ideas.
It’s likely that when retrieving information, all of your top results are very similar—they all relate to the same combination, \mathcal{A} \cap \mathcal{B} because that’s the closest match to what you’re looking for since it includes two out of the three disparate ideas.
It’s very possible that your retrieval in this case could completely miss pulling in any specific information about idea \mathcal{C}.
So, what I propose is something that comes from my field of experiments design—orthogonal subsampling.
You want to pull in resources which are a very high match for the desired generation but they should be as different from each other as possible.
Here's a toy example from experimental design
Say you have a machine that makes widgets.
- The machine has two dials (call them x_1 and x_2) which can be set continuously between 0–1.
- The widgets can be objectively measured for quality on a scale of 0–1.
- You have a set number (n) of tests you can perform before you have to do a production run of, say 1,000,000, widgets.
Your goal is to figure out the combination of dial settings that produces the best widgets.
I’m not going to go too deeply into the theory and finer details here but if the number of runs was n = 4, without any other information, the runs you’d want to do are,
The gist is this design encloses the most area possible in the design space, the (centered) column-vectors are orthogonal to each other, and the measured points are maximally distant from each other.
We can employ principles of experimental design to sampling by trying to choose a sample that is as similar to what we would have designed.
Turning back to the idea of embeddings before I stray too far afield…
If you were planning on injecting two embeddings into context you might have three results with cosine-similarity scores of, say, 0.92, 0.90, and 0.86. At first glance the obvious choice is to include the first two—they’re more similar to what you are looking for. But if the cosine-similarity between the first two is, say, 0.99 that means they are essentially the same embedding, and you gain nothing from including them both. But, if the cosine-similarity between the first and third is, again let’s just say, 0.63 then we can be more confident that the information in the third is different than the information in the first (or second).
So, under the assumption that your embeddings are all (roughly) the same number of tokens, we get more total information added to the context by including the first and the third than we would by including the first and the second.
Something that wouldn’t be picked up simply by ranking them.
But, you could have an even more complicated setup, maybe your first ten results have cosine-similarity scores to your query ranging between 0.92 and 0.84, but they all have a cosine similarity with each other over 0.90, then you have your eleventh retrieval with a cosine-similarity score of 0.82 with a maximum cosine-similarity to any of the top-ten of 0.65. You might want to go even further down to pick up this eleventh retrieval under the idea that it might have “new” information not contained in the first ten.
Incidentally, this is one of the problems with the embedding space not being truly filled, but rather existing in a relatively small cone—nothing is truly orthogonal, so we can’t be absolutely certain if, in this case, the eleventh retrieval has different information or just less information. But, in general, I think we should assume that as the cosine-similarity between two retrievals approaches the observed minimums, the information contained is more different.
Apologies if that got a bit rambly, I promise I went back through an cut out huge swaths to try to make it more streamlined and readable.