How does tabular numerical data work in vector databases in other words can you just take a csv with numbers like salary data and save it in a vector database AND still be able to perform operations on it as if it were tabular e.g selection and filtering. Or is some context lost? Can it get MAX from a column easily?
- Your still limited by the same context window. The only thing embeddings/vector DB’s help with is retrieving chunks of the most relevant information. That information is then added to the standard GPT prompt. It’s really two separate things, embeddings/vector DBs don’t let you pass more information to GPT, they just let you pass the most relevant information.
- As you might be wondering now, this doesn’t really work well for raw numerical data. But fundamentally, neither do LLMs. They are bad at math. I wouldn’t trust them with MAX anythings. Turns out, regular relational databases are really good at that sort of thing. Probably best route to explore is sharing a relation database schema with GPT and asking it to write SQL queries, but even that can be problematic.
That’s interesting. I am currently using step 2 and it works incredibly with pandas. I use few shot examples and a detailed explanation of DB. I was curious if Vector DBs offered much for numerical data at this point. Thanks for pointing it out.
Can you say how you have described your db ?
Because when I have done this thing the chatgpt could not understand which table to pick when I passed all the table names it has an issue with column names then I shared the column names based on the table names it was picking because my data is formatted in json in some of the columns I even sent the structure of it in the prompt
Even after this it could not write a proper query
Next I have tried sending it some sample queries also but this obviously was increasingly my prompt length
Now I am trying to use embeddings to do these things
wish to hear your opinion on that
I have posted a question about this in the community please have a look
Do you have many databases oR one ?
If you have one, create as part of your prompt a description of the database include a percentage of column names and explain what they mean etc – and then give it actually few shot examples of queries you need.
Your first prompt must have a lot of info – maybe not too big-- then you will get a reliable sql or python that matches specific database.
But also try to store prompts as a string - before sending and printing them to check where it is going wrong.
Temperature is also important, lower is better when it comes to generating code – say 0.3
All in all what mine does is simple it knows the database well and it has good examples of code i expect i interecept with a Python REPL and run the pandas and send the result as html which i output on frontend.