I face the same challenge dealing with very large knowledge base datasets. Similar to you, what I sometimes end up with is a good answer, but not the most comprehensive answer. As mentioned earlier, the first thing you may want to look at is your embedding strategy.
I always use Semantic Chunking to make sure each chunk is as relevant as possible to it’s context: https://youtu.be/w_veb816Asg
Next, I use metadata as much as possible to identify where in the overall document each embedded chunk belongs:
I have an object property called “Questions”, which are questions that this particular chunk answers. This has proven to help raise the SNR (signal to noise) ration significantly.
As for the embeddings themselves, I use text-embedding-ada-002 which produces vectors with 1536 dimensions
I can’t speak to using multiple embedding models as I’ve not done this, but I use Weaviate as my vector store, and it has an embedding transformer text2vec-openai which has been working very well for me for several months now.
Now, all that said, sometimes with some 50 results, I still do not get the most comprehensive answers with some document sets (particularly the labor agreements and religious sermons). I have found that increasing the document retrieval limit much past 50 does not always yield the anticipated results. The model, gpt-4-turbo in this case, may have a much larger input context window, but the more text you put in it, the less efficient it is at understanding that text. See: Anthropic's best Claude 2.1 feature suffers the same fate as GPT-4 Turbo
So, I’ve come up with my own strategy, which I call “Deep Dive”. It is similar to this approach, where we use the LLMs to extract more comprehensive results: Biggest difficulty in developing LLM apps - #41 by plasmatoid
But mines is even more simple: I run the cosine similarity search and return the top 50 - 100 retrievals as discussed earlier. But, instead of returning those to the primary LLM to evaluate, I use a cheaper model (with lower context window) to evaluate the results in batches of 10-15 and give me the most relevant matches. Once this LLM has evaluated all the results, I then send it’s results, along with the most relevant documents, to the main LLM. From here, I do one of two things: either return the concatenated results of the secondary LLM, or have the primary LLM evaluate the top relevant documents. Not sure which one works best yet.
Anyway, my two cents.