Hi all,
I am implementing a data system for retrieval and thought to get opinions given how fast the field is moving.
So background, I have a bunch of data in the form of documents, tables (think a lot of csv’s/excel files), and other text data.
My question relates mainly to the tabular data that i have, the text data I will embed and store in a vector db.
The two approaches possible for the tabular data are:
- More traditional:
- Transform into a common structure and pass into a traditional relational database (postgres, etc).
- After that using the metadata from each table with Llama Index: SQLAutoVectorQueryEngine to get the data that I need for each question regarding the data
Pro’s:
I can tell exactly what is being queried to get what results and I have more control over the databases themselves and their associated metadata and description.
Con’s:
A lot harder to scale the structural data portion of this as more data floats in as csv’s/xlsx files.
Will there be confusion as to how to use the combination of the text/document data in the vectordb combined with the relational data in warehouse?
- Knowledge graph and graph DB’s:
Rather than structure the data for consumption into a Relational database, use Llama Index and unstructured to convert the tabular data into a format capable of being used as knowledge graph and graph DB.
I BELIEVE that the process for creating such graph’s is fairly automated by LLama Index and Langchain.
Pro’s:
Easier to scale.
The relationships might make it easier to pull the relevant data especially given the scale.
Con’s
I am not sure how well numeric data, the type that is generally stored in relational databases for storage does in a graph DB. Are they able to build relationships easily and accurately?
Would love some thoughts and opinions,
Hi there …maybe its too late and you already have figured this out but i can only give you some untested opinions 
-
w.r.t. approach #1 your con is spot on …especially with the meta data portion …wont you have to explicitly define metadata for each of ur existing data sources and then again add more info as they keep changing OR u add newer data points? also how will you address foreign key constraints and other complexities typical of production data ( the toy example provided in llama_index is absolutely juvenile, TBH …the spreadsheets most people play with in the industry are far more complicated and have obtuse naming conventions that will require a mapping even for humans to understand and define this mapping to the LLM )
-
w.r.t. option 2, sadly its almost entirely impossible currently since Knowledge graphs by definition are fantastic to store relations between different entities , so in case of a structured DB ( even an excel / spreadsheet ) how would you ascertain the relationships between different columns ? for e.g. if CITY, YEAR and POP are 3 columns in your toy DB, NO SYSTEM in the world, is currently capable of just looking at the data and saying what is the relation between the population trends of each city by year ( gpt4 makes some attempts, but u will probably need a wolfram plugin as well ) … so in such a scenario, assuming that langchain or llamaindex will do this entity resolution for you and also help you store all the data regarding a particular entity some place is a little presumptive
my humble recommendation in this scenario is to begin with a simple RAG system that just maps the query to which table / set of tables the info can be ascertained from and then re prompt the LLM to come up with a query involving those tables AND then matching it with pre written queries ( atleast in the first 3-4 weeks of trials ) …even post that it should help with reducing hallucinations …sadly there’s NOTHING automated about ANYTHING even today …lots of checks and balances if u want to go beyond a toy example