RAG and Embeddings - Is it better to embed text with labels or not?

Hi all,

I’m building a large-ish RAG system for reviews (~2M documents), each of these documents is small (~512 tokens) and contains text that is somehow structured since each document has several different properties which are clearly delineated.

Example:
Each user review in my dataset comes from asking the user several questions like:

  • how did you feel about the product when it arrived?
  • how did you feel about the product after a week using it?
  • how did you feel about the product after a month using it?
  • describe the places where you’ve used the product?

The search strategy here is the usual for RAG, embed all the documents and look for the closest ones to a given query.

So, when embedding these documents, I could choose to either label these questions expliclity in the text vs. just throwing the questions away and concatenating all the answers in a single block of text.

Example of labeled text for embedding:

>how did you feel about the product when it arrived?
It was nice and the packaging was good, I was excited because ...

>how did you feel about the product after a week using it?
I'm not completely satisfied as it turns out that ...

...

Example of “unlabeled” text for embedding (just the answers):

It was nice and the packaging was good, I was excited because ...

I'm not completely satisfied as it turns out that ...

And so, the question is, which one of the two options would be better if I’m only concerned about improving the accuracy of the result set I get when querying this later on.
(Don’t worry about the G part of RAG, when I do the final completion step I just supply the whole document to the GPT).

Obviously the right way to go on my end is to measure this, do both and see which one performs better, but I wanted to ask this here as I’d like to hear from your experience around this.

2 Likes

@almosnow - I may be misunderstanding (apologies if so, please elaborate and I’ll help where I can) – but this architecture sounds backwards at first glance. One of those situations where “When you’re a hammer, everything looks like a nail.” (in this case, an LLM being the hammer).

Because I may be misunderstanding, I’ll explain how I would go about this. If I’m wrong, I think that in your explanation, it will help constructively answer your queston with the best solution.

How I’d architect this:

  1. Create a scattered ingestion of enough “documents” (questionnaires) to build your classification dataset; a simple list of categories / etc.
  2. Build a prompt to convert each of the freeform questionnaires into structured data, which will be stored along with the original questionnaire text.
  3. With the data now in-place, your application (the ability to search/analyze/report) on the data will hit a relational database. Or better yet, a relational DB with a vector store as well.

In other words, you should be able to gain far more capability by first processing the data into something more usable for your purpose, than using an LLM in place of the RDBMS/SQL component. Even if you need an LLM to replace the “interface” (aka human-data-middleware), you’re still better off with the data being processed and the RDBMS/vector RB doing the bulk of the filtering.

Does that make any sense?

4 Likes

If I read the examples of the question and answers you provided. My observation is that some of the context and hence the meaning of the response is lost. With the question included, the meaning of the response is clear that the customer was excited when the product arrived and then after a week was less than satisfied. Without the question that context is lost.

So, I would advise clearly labelling question and response and any other context you have before embedding.

@DevGirl read a bit about RAG, you’re completely lost

@dlaytonj2 thanks, but,

if I’m only concerned about improving the accuracy of the result set I get when querying this later on

Not knowing anything about what you want to do, I would generally agree with @dlaytonj2 that labeling your embeddings will give you the better responses.

However, and perhaps I just can’t read it correctly, while I see the data you are hoping to search, I do not see the questions you intend to ask. Now, I can assume that if these are reviews tied to specific products or services, then you are expecting questions about these products and services from potential customers.

Also, don’t be too harsh on @DevGirl

It is entirely possible that your goals could be achieved with a traditional SQL approach.

2 Likes

It depends, on what kind of result set you are looking for and what type of query you are doing – you don’t say. If you are doing RAG then I assume you want to do some kind of semantic query perhaps separating positive feedback from negative based on the customer comments. If that is all you are looking for then answers will be fine. However if you are looking for only feedback pertaining to product after a week then without the question that context is lost when you don’t include the question.

I would also add that @DevGirl is also right to suggest that in the absence of knowing what kind of queries you want to run a relational database might also be the answer. I am assuming you primarily want to do semantic searches on customer feedback and that is why you are using a RAG approach.

2 Likes

More information is better …

So the question is, what do you hope to retrieve after the correlation?

I can see just having 4+ RAG shards, one for each question, to keep all information under a question contained. So you would get 4+ queries from your 4+ questions.

Then you have 4+ groupings (4 x Top_K things, siloed under each question) of related things to present to the LLM prompt. Also with the understanding of which question this information spawned from, for the LLM to see.

But what would you do with this? Especially since each query for each question could come from unrelated products?

If you need to lock down on one product, then you need further sharding, each product P and Q questions and K retrievals per question, So P*Q shards instead of Q.

Then still, are you just doing search with embeddings? That’s OK … but the RAG and the “G” part is not clear here.

4 Likes

@DevGirl read a bit about RAG, you’re completely lost

I run several RAG based systems.

As I explained, perhaps I’m misunderstanding the problem – but it sounded like you wanted to use a RAG-based LLM in place of a query mechanism that would benefit from a more structured approach, whether that be pre-processed by an LLM to achieve – or a hybrid approach.

That’s precisely why I asked if you could explain why my thinking was wrong, because it would help me understand what I was missing. I’m referring to my comment:

@DevGirl - I’ll explain how I would go about this. If I’m wrong, I think that in your explanation, it will help constructively answer your queston with the best solution.

2 Likes

Thanks @curt.kennedy,

But what would you do with this? Especially since each query for each question could come from unrelated products?

A typical query on this dataset would be something like:

“Retrieve all the products which had a positive review on the first week but a negative one on the last week”

Then still, are you just doing search with embeddings? That’s OK … but the RAG and the “G” part is not clear here.

Yes, I’m only concerned with the search/retrieval part of RAG, the generative part is solved quite well by providing GPT with the context and the question.

So to handle this query, you need a few more things.

So you need an additional sentiment field, so each question has this attached, say Positive, Negative, Neutral. This is a database field that you would retrieve through normal database queries, not the LLM.

You probably also want a timestamp field, so some sort of UNIX timestamp that you can threshold to run periodic reports on, and force it to look at new data, not rehash the past data all the time.

So using the sentiment field, and timestamp field, you would first query the database for these items, then get the set of products. There is no LLM required for this.

2 Likes

If you have a number of products with the reviews attached to them a Graph Database could be a nice fit here.

You can attach embeddings, crawl an LLM to create metadata such as Sentiment, and then also identify weak/strong points really easily. I also don’t see why if you have a preset list of questions why you wouldn’t just have it as a field.

It seems to me necessary to embed both the question/label & answer together

The data visualization is a lot of fun as well.

3 Likes

To answer your question, The first one will be better in terms of accuracy as the embeddings will have more context.

You are right. That’s actually the right way, because then you might figure that none of them is returning an acceptable accuracy based on the use case query that you want to run:

As you wanted to know about the experience of others around it,
@DevGirl mentioned a potentially better way of doing this to get better accuracy than just doing a vector similarity search. Which is the use of RDBMS to filter things out and add a vector similarity search on top of it.

TL;DR

The simple answer to your question with some assumptions is: “Labeled Text”
The relevant solution to your use case is: “Use RDMBS + vector similarity search (Optional)”

1 Like

Thanks @RonaldGRuckus, and ye, visualizing the whole thing helps a lot :smiley:,

I also don’t see why if you have a preset list of questions why you wouldn’t just have it as a field.

Because there is no such thing.

@curt.kennedy you had me until the very end:

There is no LLM required for this.

and others,

“Use RDMBS + vector similarity search (Optional)”

Do you really think this is a problem that could be solved with a select * from reviews where week = 'good' AND month = 'bad'?. Do you think I mixed up the postgresql mailing list with the OpenAI community? Fitting this problem into an RDMBS would be taking a step backwards into solving the issue, thanks for negating, literally, the whole f* point of using an LLM to deal with unstructured data :joy:.

My question is extremely specific and it also has a very specific answer which is pretty much self-contained in the title, in the context of RAG (and OpenAI embeddings) do you get more accuracy if you embed some text with labels or not?

inb4 “it depends”

No, it doesnt! Grab any, literally any, organic or synthetic dataset, embed it with/without text labels then query it with/without labels and measure the retrieval accuracy. That’s it, I can do that in an afternoon but wanted to hear from people who (I thought) had the problem. “but muh SQL”, what a joke.

In the context of your text: Yes, 100%.

A question like this cannot be answered by LLM Retrieval (Requires logic). But if you have a relational database you can use an LLM to organize/structure the unstructured text to THEN perform queries afterwards through metadata.

2 Likes

A question like this cannot be answered by LLM Retrieval (Requires logic).

Here you go, an “impossible” feat (and it’s not even GPT4):

Right. I’m on the same page.

Using an LLM to “crawl” through all of your “unstructured” documents to structure them would mean that you can afterwards perform queries on all (~2M documents) without spending a lot of money.

1 Like

My point is that the logic that supposedly requires the RDBMS layer can be introduced trivially in the G part of RAG.

But, to get the right document set to supply as context is the meat of the issue,

An ideal embedding algorithm would cluster all documents like:

>after first week
good
>after first month
good

together and a bit apart from other documents like:

>after first week
good
>after first month
bad

And how you set up your embedding texts/queries will def. bring you close/further from that goal, which includes if/how you choose to label stuff.

Nope, no LLM.

Sentiment classifiers can be done with other non-LLM models (much cheaper BTW), and probably even embeddings would be a good fit for this problem, like your own custom embedding classifier, that would give you more than simple sentiment.

The DB logic and timestamps don’t have an LLM either.

Summarizing stats via histograms of products that satisfy this “good at first, then bad” criteria also don’t require an LLM.

Nothing you have said requires an LLM.

And with the volume of data (2 million reviews), I would dodge using an LLM unless you really need it, just due to cost alone.

2 Likes

Good to know that you already have the answer to your question.

One reason that the posts of literally everyone else doesn’t align with your expectations is that the example you gave and the chat you have posted are different. There is no mention of the >product question in your examples. But your chat with gpt-3.5-turbo has that extra component.

Feel free to correct me if I missed the mention of >product question in the chat.

3 Likes

Not seeing that. A fine-tuned version of Babbage, or even labeled embeddings, would give you much more control and precision in sentiment and classification than GPT-4.

Use GPT-4 all you want. But it doesn’t fit the problem you are describing.

4 Likes