Understanding Vector Database

How does tabular numerical data work in vector databases in other words can you just take a csv with numbers like salary data and save it in a vector database AND still be able to perform operations on it as if it were tabular e.g selection and filtering. Or is some context lost? Can it get MAX from a column easily?

  1. Your still limited by the same context window. The only thing embeddings/vector DB’s help with is retrieving chunks of the most relevant information. That information is then added to the standard GPT prompt. It’s really two separate things, embeddings/vector DBs don’t let you pass more information to GPT, they just let you pass the most relevant information.
  2. As you might be wondering now, this doesn’t really work well for raw numerical data. But fundamentally, neither do LLMs. They are bad at math. I wouldn’t trust them with MAX anythings. Turns out, regular relational databases are really good at that sort of thing. Probably best route to explore is sharing a relation database schema with GPT and asking it to write SQL queries, but even that can be problematic.
2 Likes

That’s interesting. I am currently using step 2 and it works incredibly with pandas. I use few shot examples and a detailed explanation of DB. I was curious if Vector DBs offered much for numerical data at this point. Thanks for pointing it out.

Hey @jethro.adeniran
Can you say how you have described your db ?
Because when I have done this thing the chatgpt could not understand which table to pick when I passed all the table names it has an issue with column names then I shared the column names based on the table names it was picking because my data is formatted in json in some of the columns I even sent the structure of it in the prompt
Even after this it could not write a proper query
Next I have tried sending it some sample queries also but this obviously was increasingly my prompt length

Now I am trying to use embeddings to do these things
wish to hear your opinion on that

I have posted a question about this in the community please have a look

Do you have many databases oR one ?
If you have one, create as part of your prompt a description of the database include a percentage of column names and explain what they mean etc – and then give it actually few shot examples of queries you need.

Your first prompt must have a lot of info – maybe not too big-- then you will get a reliable sql or python that matches specific database.
But also try to store prompts as a string - before sending and printing them to check where it is going wrong.

Temperature is also important, lower is better when it comes to generating code – say 0.3
All in all what mine does is simple it knows the database well and it has good examples of code i expect i interecept with a Python REPL and run the pandas and send the result as html which i output on frontend.

1 Like