RAG and Embeddings - When to embed context (Follow-Up)

DevGirl · January 28, 2024, 11:23pm

*Background: *
I invested 30 minutes to write a comprehensive reply to answer this question in a manner that benefits others,. While replying, the original topic was closed.

I believe my reply will help others (not merely the main poster). Therefore, I’ll post the following:

Quote:

@almosnow A typical query on this dataset would be something like:

“Retrieve all the products which had a positive review on the first week but a negative one on the last week”

…followed by…

@almosnow - Do you really think this is a problem that could be solved with a select * from reviews where week = ‘good’ AND month = ‘bad’?.

I’m so thankful that you’ve posted this (and it’s one of the few replies I’m still able to read). Because this elucidates the issue very clearly.

You stated that using an RDBMS is a “step back” and that to you, our suggestion amounted to simple SQL.

Your understanding has no parity with what we’ve attempted to explain. Zero.

The problem is not with our attempts to help, but your interpretation / assumptions. No one suggested a simple SQL approach alone would be the answer.

As someone that’s worked with multi-million record datasets…

…and sentiment analysis (long before LLM’s) and well as million-record vector and RAG systems, it’s clear to me that everyone that has replied here, actually has the experience (not merely the knowledge/theory – but pragmatic, firsthand experience) – to understand the nuance that you’ve overlooked.

A small problem of irony –

You’ve implied you’re an expert with LLM’s and RAG; or at least, you know better than many of us with expert-level experience.

Yet at the same time, you’ve suggested that your question is relatively predictable/deterministic.

While you’ve written that old-tech RDBMS are beneath you, you’re asking a question about a less structured, less predictable architecture as if it functions with the simplicity of the structured data you disparage.

Equally ironic here, is…

RAG is quite different depending on the structure of the content and therefore, it’s impossible to answer this definitively without a couple documents with full examples (the full Q&A to understand the ratio of Q to A text, for example). An expert in RAG would appreciate this nuance.

Even if we attempted to predict the outcome based on our experiences, RAG will surprise you (both good and bad; typically more on the bad side in an instance like yours with very fine-grained queries that are aided by some structure, the reason we’ve suggested a hyrbid approach).

If you were to follow our suggestions of (for example) – pre-processing your data to exploit ML, then combine that with the benefits of some structure (in addition to the vector store) – we can provide a more predictable/definitive answer.

Therefore, the really short, direct answer you seek is as follows:

The odds are that yes, labeled embedding (versus removing the questions) will, more often than not, be the better approach – particularly considering your retrieval use case.
However, you’ve suggested that your content (answers) are short and the relatively long embeds (questions) versus content (answer) length – in a multi-million document application, does not fit squarely within the basis for efficacy of #1;
Therefore, the easiest (by far) and most accurate answer is to take ~100k documents and try both. If you’re going forward with the application, this requires an inconsequential amount of time/effort/cost and it wouldn’t make sense to move forward without the empirical data.
If you’d like to supply us with a couple examples of the entire documents (both Q &A) we can give a bit more insight; however, based on what we have so far – #1 qualified by #2 … is the answer you seek (whether those of us with a great deal experience in this field agree it’s the best approach or not).

Simply put:

If you don’t want to test (being experienced, I’m assuming you wouldn’t skip this step, rendering your post moot) – but in that case: embed the questions.
However, as an expert in this field, you already know the proper answer – that without more data, none of us can provide you with an answer that is valuable due to the unique nature of GL/RAG.

Sidenote:

In several instances, you’ve been dismissive and condescending in people’s attempts to help. I understand that perhaps you are far more advanced than the rest of us.

As you are highly advanced, perhaps you can share your expertise. I encourage you to sort through some of the questions posted here and reply to help others.

…Good luck, I wish you the best success in your RAG based application.

almosnow · January 29, 2024, 12:01am

Sure! Here are my comments.

The odds are that yes, labeled embedding (versus removing the questions) will, more often than not, be the better approach – particularly considering your retrieval use case.

Yeah, I know that today (Sunday) after doing some tests with synthetic data. That was the kind of answer I was looking for yesterday (Saturday), in anticipation of what I needed to do to find an answer on my own.

However, you’ve suggested that your content (answers) are short and the relatively long embeds (questions) versus content (answer) length – in a multi-million document application, does not fit squarely within the basis for efficacy of #1.

Just to clarify, yes, the documents here (content) are relatively small (~512 tokens, max is ~1024) but still much larger than the queries, which are quite short and concise. So, this hasn’t been an issue.

Therefore, the easiest (by far) and most accurate answer is to take ~100k documents and try both. If you’re going forward with the application, this requires an inconsequential amount of time/effort/cost and it wouldn’t make sense to move forward without the empirical data.

Yes, I did that and I found that labeling the data doesn’t really improve retrieval accuracy that much (will update when I find out why).

If you’d like to supply us with a couple examples of the entire documents (both Q &A) we can give a bit more insight

The “reviews” thing was a toy example that closely relates to (behaves like) my use case. My actual use case will be kept out of this conversation since I’m under an NDA. But it wouldn’t be hard for me to compile a test dataset and share it here (the amazon reviews one, or w/e).

(misc stuff)

rendering your post moot

Even if I already knew the answer (which I didn’t), asking it here doesn’t break any rules.

While you’ve written that old-tech RDBMS are beneath you

I never wrote this, at all. RDBMSs are great at solving the specific problems they were designed for. In the context of my question, RDBMSs are irrelevant, at its core its almost a purely mathematical question.

…Good luck, I wish you the best success in your RAG based application.

Thanks, it’s already working amazingly well .

Last, but not least, no idea why this comment was flagged on the other thread, as you can see there’s nothing wrong with it. I’d like to ask about that again, can I take a look at any of the systems you’ve built? (any of them in a public facing product, perhaps?) I’m sure that will be of much relevance for this discussion :).

DevGirl · January 29, 2024, 12:34am

I’ll refrain from engaging in any of the comments where I believe you’ve misunderstood me, because this won’t help others.

If you ever need to provide examples, the best route is taking the most conventional/standard examples you have, feeding them to an LLM and requesting it give you an analog. In your prompt, specify the topic and meta details so that the compute time / prediction focuses solely on the most analogous, direct parallel.

As an expert in RAG, you understand that your question requires full example documents because you’re dealing with a highly variable, finicky approach.

I look forward to your follow-up to help others, along with your replies to help other people in the forum.

Beyond that, please respect the preference of the admins and consider this closed. You have your answer and as you’ve stated, you’ve tested directly.

Note to help future readers (OP – please ignore)

Those who are new to RAG often overestimate its consistency for quantitative capability in structured data. This is far better achieved utilizing the strengths of different technologies just as diffusion/transformers are not suited for interpreting seemingly pedestrian mathematic instructions. Therefore, an augmented approach is best. If you’re new to RAG in sentiment analysis use cases – it’s rarely ever the best/sole approach without sentiment/categorical pre-processing.

Perhaps most importantly: For the most accurate results – refer to the quick/easy tip above on how to provide helpful example text, without violating any NDA or general privacy concerns.

almosnow · January 29, 2024, 4:23pm

This forum is not meant to cater the “preference of the admins”, unfortunately. I will post until I feel satisfied with my interaction with this site. If mods keeps closing that, that’s abuse. If they get away with it, it’s still abuse.

You have your answer and as you’ve stated, you’ve tested directly.

I’m not satisfied yet, I’ve found some quite interesting stuff on my own and would like to share it/get feedback, so I may even ask again if I feel like it :D.

Btw, I would still like to inquire about the RAG systems you’ve put in place, in the spirit of learning, as you’ve mentioned several times, I would definitely want to know more about your experience there.

DevGirl · January 29, 2024, 9:57pm

You, my friend, murder irony.

You haven’t provided enough information on your application for us to adequately help.

Yet in the same breath, you feel entitled to ask others to share their work.

RAG (and sentiment analysis/classification) is most commonly used back-end and it’s the only use case for which I’ve ever implemented.

I do not have anything I can publish for you, nor will anyone else that I can imagine posting here, unless someone has open sourced/contributes to open source efforts relating to some sort of democratized RAG for benefit of others.

Great, good luck to you in those future requests.

almosnow · January 29, 2024, 11:19pm

I don’t owe you anything, though.

You also don’t owe me anything, but since you’ve alluded to some sort of concept of helping others I wanted to see if your words meant something or were just wishy-washy. Facts speak louder than words.

nor will anyone else that I can imagine posting here

You’re only a user on this forum, chill. I can assure you plenty of people would gladly share their knowledge. Terrible attitude for a forum of this nature .

Topic		Replies	Views
RAG and Embeddings - Is it better to embed text with labels or not? Community embeddings , semantic-search , rag	20	8181	January 28, 2024
Biggest difficulty in developing LLM apps API development	75	6507	January 12, 2024
Should I modify user queries before semantic search? API chatgpt , api	22	2945	June 28, 2024
Refuses to assist in OS experiments Community	6	654	March 1, 2023
Something happening here and it is seismic Community gpt-4 , chatgpt	56	3418	May 20, 2024