Based on your original question, the users questions are generic, which is typical, and not aligning with your data, right?
So the solution, is to use the LLM to steer the question into multiple aspects of your data, then run search, retrieve the top K chunks, and have the LLM respond with your data.
If there is a followup question, feed the previous history, and continue steering the followup. The history will maintain context, and the steering will keep your correlations high and hopefully on-message.
The steering can be on transforming any input (question or otherwise) from the user into your data.
Also, you want to quality check your answer based on previous history.
Every muffin has a top and a bottom. The top of the muffin consists of all the background projections (steerings) into your data. (This is the Hydra, the many headed monster/beast from Greek mythology).
Then there is the RAG part, which includes final answer generation, or list of candidate generations.
Finally, there is the bottom of the muffin. What is this for? Well, this is checking the quality of your answer given prior “approved” answers. It uses embeddings and closeness to give the confidence factor. The theory is that the input/output is a continuous function … so close inputs correspond to close outputs. So you use embeddings to validate this input/output closeness relationship based on prior expectations.
This is Hydra-RAGamuffin!!! (said like THIS IS SPARTA!) :image of ragamuffin cat with multiple heads:
This is good advice if latency is top priority. Hydra-RAGamuffin has fairly lax latency requirements.