*Background: *
I invested 30 minutes to write a comprehensive reply to answer this question in a manner that benefits others,. While replying, the original topic was closed.
I believe my reply will help others (not merely the main poster). Therefore, I’ll post the following:
Quote:
@almosnow A typical query on this dataset would be something like:
“Retrieve all the products which had a positive review on the first week but a negative one on the last week”
…followed by…
@almosnow - Do you really think this is a problem that could be solved with a select * from reviews where week = ‘good’ AND month = ‘bad’?.
I’m so thankful that you’ve posted this (and it’s one of the few replies I’m still able to read). Because this elucidates the issue very clearly.
You stated that using an RDBMS is a “step back” and that to you, our suggestion amounted to simple SQL.
Your understanding has no parity with what we’ve attempted to explain. Zero.
The problem is not with our attempts to help, but your interpretation / assumptions. No one suggested a simple SQL approach alone would be the answer.
As someone that’s worked with multi-million record datasets…
…and sentiment analysis (long before LLM’s) and well as million-record vector and RAG systems, it’s clear to me that everyone that has replied here, actually has the experience (not merely the knowledge/theory – but pragmatic, firsthand experience) – to understand the nuance that you’ve overlooked.
A small problem of irony –
You’ve implied you’re an expert with LLM’s and RAG; or at least, you know better than many of us with expert-level experience.
Yet at the same time, you’ve suggested that your question is relatively predictable/deterministic.
While you’ve written that old-tech RDBMS are beneath you, you’re asking a question about a less structured, less predictable architecture as if it functions with the simplicity of the structured data you disparage.
Equally ironic here, is…
RAG is quite different depending on the structure of the content and therefore, it’s impossible to answer this definitively without a couple documents with full examples (the full Q&A to understand the ratio of Q to A text, for example). An expert in RAG would appreciate this nuance.
Even if we attempted to predict the outcome based on our experiences, RAG will surprise you (both good and bad; typically more on the bad side in an instance like yours with very fine-grained queries that are aided by some structure, the reason we’ve suggested a hyrbid approach).
If you were to follow our suggestions of (for example) – pre-processing your data to exploit ML, then combine that with the benefits of some structure (in addition to the vector store) – we can provide a more predictable/definitive answer.
Therefore, the really short, direct answer you seek is as follows:
- The odds are that yes, labeled embedding (versus removing the questions) will, more often than not, be the better approach – particularly considering your retrieval use case.
- However, you’ve suggested that your content (answers) are short and the relatively long embeds (questions) versus content (answer) length – in a multi-million document application, does not fit squarely within the basis for efficacy of #1;
- Therefore, the easiest (by far) and most accurate answer is to take ~100k documents and try both. If you’re going forward with the application, this requires an inconsequential amount of time/effort/cost and it wouldn’t make sense to move forward without the empirical data.
- If you’d like to supply us with a couple examples of the entire documents (both Q &A) we can give a bit more insight; however, based on what we have so far – #1 qualified by #2 … is the answer you seek (whether those of us with a great deal experience in this field agree it’s the best approach or not).
Simply put:
- If you don’t want to test (being experienced, I’m assuming you wouldn’t skip this step, rendering your post moot) – but in that case: embed the questions.
- However, as an expert in this field, you already know the proper answer – that without more data, none of us can provide you with an answer that is valuable due to the unique nature of GL/RAG.
Sidenote:
In several instances, you’ve been dismissive and condescending in people’s attempts to help. I understand that perhaps you are far more advanced than the rest of us.
As you are highly advanced, perhaps you can share your expertise. I encourage you to sort through some of the questions posted here and reply to help others.
…Good luck, I wish you the best success in your RAG based application.