How to analyze big CSV files for a chat bot?

Hi, I’m currently interested in developing a chat bot that lets a user ask a question about a dataset. Let’s imagine the dataset (CSV) is sales history for a company with several descriptive columns like store-id, date, item-title, and so on. An example prompt would be:

What were the worst selling products in September? 

Some problems I’ve discovered while planning/researching how to do this:

  • Files are too large, which makes it difficult to directly feed it because of restricted token usage.

  • Embedding with the use of vector databases are hard because the models are good at resonating on text (e.g. articles, blog posts, etc) and not necessarily any good at giving answers for CSV files that are vectorized.

What I’m thinking of right now is to code an application that:

  1. Extracts the column names
  2. Captures the user input (e.g. question above)
  3. Combines them into a prompt which is sent to the OpenAI API.
You are to generate NumPy code based on the following columns from a CSV file:

<columns>

and the following prompt:

<prompt>
  1. Run the returned NumPy code on the CSV file.
  2. Return the output to the user.

What do you guys think? At this point in time there seems to be that the technology just isn’t there based on my research.

Another idea I just got is to transform the CSV rows into “human-readable” sentences and then vectorize that. Is that possible? Would that make the querying yield better results?

1 Like

Hi, I have been trying to work on something similar. Similar to what you have experienced, the performance of vectorised embeddings using csv data is quite horrible. I have although come across csv agents of langchain that can utilise openai to answer question with fairly decent accuracy. The only caveat being that the agent is still at an experimental stage and can only answer questions directly. It can’t get creative and say, generate a trivia based on the data. Were you able to find any other solution? It seems to me the only way would be to convert the csv data to text documents, which seems cumbersome.