RAG and Embeddings - Is it better to embed text with labels or not?

@almosnow - I may be misunderstanding (apologies if so, please elaborate and I’ll help where I can) – but this architecture sounds backwards at first glance. One of those situations where “When you’re a hammer, everything looks like a nail.” (in this case, an LLM being the hammer).

Because I may be misunderstanding, I’ll explain how I would go about this. If I’m wrong, I think that in your explanation, it will help constructively answer your queston with the best solution.

How I’d architect this:

  1. Create a scattered ingestion of enough “documents” (questionnaires) to build your classification dataset; a simple list of categories / etc.
  2. Build a prompt to convert each of the freeform questionnaires into structured data, which will be stored along with the original questionnaire text.
  3. With the data now in-place, your application (the ability to search/analyze/report) on the data will hit a relational database. Or better yet, a relational DB with a vector store as well.

In other words, you should be able to gain far more capability by first processing the data into something more usable for your purpose, than using an LLM in place of the RDBMS/SQL component. Even if you need an LLM to replace the “interface” (aka human-data-middleware), you’re still better off with the data being processed and the RDBMS/vector RB doing the bulk of the filtering.

Does that make any sense?

4 Likes