Hi, I’m currently interested in developing a chat bot that lets a user ask a question about a dataset. Let’s imagine the dataset (CSV) is sales history for a company with several descriptive columns like store-id, date, item-title, and so on. An example prompt would be:
What were the worst selling products in September?
Some problems I’ve discovered while planning/researching how to do this:
-
Files are too large, which makes it difficult to directly feed it because of restricted token usage.
-
Embedding with the use of vector databases are hard because the models are good at resonating on text (e.g. articles, blog posts, etc) and not necessarily any good at giving answers for CSV files that are vectorized.
What I’m thinking of right now is to code an application that:
- Extracts the column names
- Captures the user input (e.g. question above)
- Combines them into a prompt which is sent to the OpenAI API.
You are to generate NumPy code based on the following columns from a CSV file:
<columns>
and the following prompt:
<prompt>
- Run the returned NumPy code on the CSV file.
- Return the output to the user.
What do you guys think? At this point in time there seems to be that the technology just isn’t there based on my research.
Another idea I just got is to transform the CSV rows into “human-readable” sentences and then vectorize that. Is that possible? Would that make the querying yield better results?