I have a question and thought to ask the smart people here. How are you managing the relationship between the unstructured data(in your embedding store) with what you already have in your data warehouse. I think if we are not relating them, then we are making data silos again.
It’s the AI that does the managing, that’s it’s super power, unstructured to structured is one of the major commercial applications, the fact that you have to do minimal pre processing gives it wide application.
I thought to give an example below. So it becomes little clear on what I am saying
Lets say I have a Q&A on my product data. So, I chunked the product description and put in an embedding store but did not do a good job in mining the metadata.
So, when a user is asking about a product, I may be able to bring in the right product description but since i have not tied it to my structured data(and insights) through the metadata, I will fail to tell the user that this product is very popular, a discount is currently going on with the product or there are some relevant products that will also go well with that
What I feel is that we will be building data and knowledge silos, if we do not integrated the embedding stores with the bigger data ecosystem.
You could just link the database keys to the metadata for each product in the embedding data store.
So when you find the relevant description/product via embeddings, you then query the corresponding warehouse metadata to pull in that separate data, and present this metadata, and the product/description for the retrieval prior to the LLM inference.
Yes, this is the approach I use. I bring back citation links to the documents the AI has used to render it’s answer. It seems to be you could take a few avenues with this approach:
- Simply provide the links to the expanded documentation about your products.
- Display the relevant additional information about the linked products in the response. That is, the AI answers, “blah, blah”, and instead of simple links, you could add snippets from the product documents to this response. Or, if your structured data has fields with this info, you can bring back (using an sql function) the relevant fields for each product listed.
- The more expensive route would be to do a 2nd API call with a prompt to read the returned documents and add a summary of the additional information to the response.
- Update your unstructured product description embeddings with the relevant product metadata that you want mentioned whenever documentation for that product is used in a response. You can modify your prompt to tell the AI to add any information in “current discounts”, “popularity”, “associated products”, etc… This will require that you keep your “unstructured” data updated with relevant metadata, but this way you can be sure the AI will always mention the additional info. Again, I do something similar where I augment my base content with additional titled metadata that I can refer to in my prompts. I use Weaviate vector store, but I know PineCone supports this as well.
100% agree. VBD alone is not enough. In their vanilla implementations, they really only support semantic search and some basic similarity checks, but I’m in your camp. I’ll tell you how I am doing it.
I have a need for several categories of data that need to be pulled in. Legal text, a different type of legal text on a different but related subject, and standard operating procedure manuals. I need examples from all 3 types of databases sometimes. But, it is kind of a pain in the ass to query and select n examples from each database every single time.
So I have a metadatabase. It is a vector database that has a summary (pre-summarized by gpt) of every database underneath from every category. I have about 40 databases right now, going to expand to a lot more after the 2 other categories are added. My bot obviously can’t take samples from every single one, so the bot pulls from the metadatabase, selects which ones are most relevant, and then queries those subset databases for the actual content.
I’m getting really good results so far for my use case and I suspect, based on talks I’ve been having with clients and partners, I’ll be doing things on a much larger scale than this. I’m doing one more thing, but I don’t know if it is only useful in my case because of the legal text.
Because of the format of the legal text (sections, subsections, subsubsections, etc), there are a ton of small entries. Sometimes content is pulled in by the LLM with missing context. So, for example, if I pull in a subsubsection, and I don’t know what the section is referring too, then I will have problems because the whole entry will be something like “and anything over 14kgs”.
What I am doing is making a simple (almost linear to be honest) directed graph that pulls N deep connections. So, the text behind and in front n nodes deep. I’ll be getting more sophisticated eventually, but for right now, in the metadata, I have a list of IDs for every “in” and “out” entry connected to the current entry. I simply get the entry, query the VDB by connection IDs, and pull in the neighbor entries. I can go N deep by querying their connections as well.
I want to expand on that capability soon, maybe where even the AI is making the edges rather than just order of entry, but for right now that is also really helping. I think the real magic will be how the AI chooses to use that connected data.
In terms of checking the original data sources, my metadata structure has a source key (file path) for every single entry and I parse out punctuation and space and hash to check the content. Moving away from that and towards Levenstein distances and ZK lookups though.
Anyways, hope that helps.