Relating the RAG related unstructured data with structured data

codie · October 3, 2023, 10:36pm

Hey joyasree,

100% agree. VBD alone is not enough. In their vanilla implementations, they really only support semantic search and some basic similarity checks, but I’m in your camp. I’ll tell you how I am doing it.

I have a need for several categories of data that need to be pulled in. Legal text, a different type of legal text on a different but related subject, and standard operating procedure manuals. I need examples from all 3 types of databases sometimes. But, it is kind of a pain in the ass to query and select n examples from each database every single time.

So I have a metadatabase. It is a vector database that has a summary (pre-summarized by gpt) of every database underneath from every category. I have about 40 databases right now, going to expand to a lot more after the 2 other categories are added. My bot obviously can’t take samples from every single one, so the bot pulls from the metadatabase, selects which ones are most relevant, and then queries those subset databases for the actual content.

I’m getting really good results so far for my use case and I suspect, based on talks I’ve been having with clients and partners, I’ll be doing things on a much larger scale than this. I’m doing one more thing, but I don’t know if it is only useful in my case because of the legal text.

Because of the format of the legal text (sections, subsections, subsubsections, etc), there are a ton of small entries. Sometimes content is pulled in by the LLM with missing context. So, for example, if I pull in a subsubsection, and I don’t know what the section is referring too, then I will have problems because the whole entry will be something like “and anything over 14kgs”.

What I am doing is making a simple (almost linear to be honest) directed graph that pulls N deep connections. So, the text behind and in front n nodes deep. I’ll be getting more sophisticated eventually, but for right now, in the metadata, I have a list of IDs for every “in” and “out” entry connected to the current entry. I simply get the entry, query the VDB by connection IDs, and pull in the neighbor entries. I can go N deep by querying their connections as well.

I want to expand on that capability soon, maybe where even the AI is making the edges rather than just order of entry, but for right now that is also really helping. I think the real magic will be how the AI chooses to use that connected data.

In terms of checking the original data sources, my metadata structure has a source key (file path) for every single entry and I parse out punctuation and space and hash to check the content. Moving away from that and towards Levenstein distances and ZK lookups though.

Anyways, hope that helps.

Topic		Replies	Views
Embedding and searching from similar embeddings API	6	7010	October 27, 2023
About the usage of ChatGPT Embedding API	9	4757	August 18, 2023
Questions about the embedding-based chatbot API embedding	4	258	December 15, 2024
Need advice storing JSON Data of product into vector DB Community embeddings , json , vector-db , lost-user , vector-store	1	208	September 19, 2025
Reducing Cost of GPT 4 by using embeddings Prompting	23	11009	May 4, 2023

Relating the RAG related unstructured data with structured data

Related topics