Private Chat with CSV data

Hi All,

I have a CSV with 50,000 employee records and I want to query the records. One approach I tried is created the embedding and stored the data in vectorDB and used the RetrievalQA chain. However the results are always wrong.

I then tried creating the create_csv_agent and it gives me the correct result. However this cosumes more tokens.

My question is what is right approach to query the CSV data.

1 Like


I have done your first approach also and yes, it is extremely sensitive to the query, so you have to adjust very carefully the API parameters, like temperature.

The second is also the best for me, although it is, as you said, inevitably more token consuming.


In the first approach while dealing with CSV let’s take example of following code:

from langchain.chains import RetrievalQA

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model=‘gpt-3.5-turbo’, temperature=1)

retriever = vector_store.as_retriever(search_type=‘similarity’, search_kwargs={‘k’: k})

chain = RetrievalQA.from_chain_type(llm=llm, chain_type=“stuff”, retriever=retriever)

answer =

Here the “k” value is important. Now let’s say that CSV contains thousands of record and the query is expected to return variable data rows…So in this case how to adjust “k” value?

What specific fields does your data contain, and what kind of questions do you expect to ask and what are you expecting to receive as answers?

Let’s say this CSV contains daily attendance data of people. The columns are Name, location, intime, out time etc. So if I want to query how many days, print number of days, dates a particular person was present.

The answer will have variable result. How to achieve this.

After tolerating the terrible output of gpt-4-turbo, back to real gpt-4 for an example that is still rooted in csv as a source, to give an idea of functions that an AI with function-calling ability could send to:

Given your requirements, it seems like you need a more flexible way to query your data. Python’s pandas library is perfect for this. It allows you to read your CSV file into a DataFrame, which you can then query in a variety of ways.

Here’s how you can implement the functions you described:

import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('attendance.csv')

# Convert 'intime' and 'outtime' to datetime, assuming they are in 'HH:MM:SS' format
df['intime'] = pd.to_datetime(df['intime'], format='%H:%M:%S').dt.time
df['outtime'] = pd.to_datetime(df['outtime'], format='%H:%M:%S').dt.time

# Convert 'date' to datetime, assuming it's in 'YYYY-MM-DD' format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

def list_names(search=None):
    if search is None or search == 'all':
        return df['Name'].unique().tolist()
        return df[df['Name'].str.startswith(search)]['Name'].unique().tolist()

def search_time_records(query):
    name = query.get('employee name')
    date_range = query.get('date range')
    time_range = query.get('time range')

    records = df[df['Name'] == name]

    if date_range:
        start_date, end_date = pd.to_datetime(date_range)
        records = records[(records['date'] >= start_date) & (records['date'] <= end_date)]

    if time_range:
        start_time, end_time = [pd.to_datetime(t, format='%H:%M:%S').time() for t in time_range]
        records = records[(records['intime'] >= start_time) & (records['outtime'] <= end_time)]

    return records

def employee_attendance_count(query):
    records = search_time_records(query)
    return len(records)

In this code:

  • list_names function returns a list of unique names in the DataFrame. If a search string is provided, it returns names that start with that string.
  • search_time_records function filters the DataFrame based on the provided query, which can include an employee name, a date range, and a time range. It returns the filtered records.
  • employee_attendance_count function uses search_time_records to get the relevant records and then returns the count.

You can call these functions with a dictionary representing your query. For example:

query = {
    'employee name': 'John Doe',
    'date range': ['2024-01-01', '2024-01-31'],
    'time range': ['09:00:00', '17:00:00']


This code assumes that your date and time are in ‘YYYY-MM-DD’ and ‘HH:MM:SS’ formats respectively. If they are in a different format, you should adjust the pd.to_datetime calls accordingly.

1 Like

I’m sure you’re aware that an LLM is not a match for basic query language (SQL, et al) and for that reason, I suspect you are involving the LLM because you have a need for semantic language/NLP interface.

In other words, you’d like a system that allows a user to use natural language to retrieve data and this is the reason you’re involving an LLM.

If this is true, you’re in luck, because there is a simple solution that requires very few tokens, and is far quicker/more accurate.

To approach this, understand what each part of the equation does best. The LLM is used to understand the user’s request, while structured query language (not necessarily “SQL”) is used recover the data.

For simplicity of this example, I’ll use SQL and provide you with a step-by-step to recreate the functionality you desire. My approach is different from @_j above – his eliminates the LLM where I’m combining the two:

  1. Import your CSV into a simple/quick datastore that can be queried based on dynamically created query language. The key being that it’s intended to interpret dynamic commands, which is why SQL is a good match. For this demo, I’d go with something ultra-basic/quick/easy like MySQL
  2. Create an LLM prompt that explains the CSV that’s stored in MySQL (the table, the purpose of the columns, etc)
  3. Write your own UI (or re-use one of the open-source UI’s available) to create a customer-facing chat-bot (much easier than it sounds, can be done in less than 50 lines of code and created by GPT).
  4. Behind the scenes, every prompt will include your explanation of the data, asking the LLM to create SQL that produces the records requested by the user.
  5. The LLM will return a SQL statement which you’ll strip out. Return the LLM’s text answer (minus the SQL stripped out), along with the recordset returned by SQL.

In short, you have a perfect blend of LLM, procedural code (your script), and set theory (SQL/etc). Each form of technology/script doing what its best at!

And in turn … You have a highly reliable piece of code that runs much less expensively, and far more accurately, than any other approach. Win-Win :slight_smile:


I provide the example developer code backend for AI function calling, so AI can write a function request for the information, then answer the user about it.

The API parameter “functions” specified can be matched to one of the three functions shown to access particular aspects of data like was pondered in the original post (while not giving the AI carte blanche to write SQL queries).

There are trade-offs to an ORB approach as you’ve suggested. I agree that if there were some security issue (in this case there is not; it’s an isolated container for a CSV) – abstracting to an ORB would be ideal.

In this instance, abstracting to an ORB ties-the-hands of the LLM whereas allowing the full power of a structured query language buys you more power/flexible querying and capability, assuming that the developer desires this.

Also – there may have been some confusion. Your solution is great, I “thanked” it, and I appreciate all of your posts. I was not criticizing, not in any way. I was explaining the difference in approach.

I posted here examples of the actual system and user message prompts I use in one application to achieve the text to SQL process similar to what @DevGirl outlined above: How to apply user based access control check while generating SQL queries using OpenAI API using langchain? - #4 by SomebodySysop

It’s really a cool way to use the LLM and your existing data without having to embed it.

A few quick examples I cut and pasted from my log:

please list the last 20 log type = sql entries. Just display the descriptions.
what was the last model used for log type ‘sql’?
what was the last model used for sql queries?
show me 10 queries from users other than Reggie or somebody. Only show the user, question, notes and date.
when was the first query I posted?
what have my queries been about?
how many queries have I made?
what has been the average cost of my queries?
show me the total time elapsed for all of my queries?
show me the dates and descriptions and cost and elapsed time and models used for the last 5 questions I asked.

1 Like

@SomebodySysop : Awesome post and recommendations. Thank you!

I have built myself an app where my customers can interact with their data (datasets from CSV files attached to the assistants), and I am running into serious latency and performance issues (Assistant API, Code Interpreter, gpt-4-1106-preview). So I want to investigate an alternative approach that would generate some SQL code based on the user input and with the SQL being executed against a MySQL database that already contains the customer’s data.

Where I am struggling is how to format the data returned by the SQL code so it feels like the AI is responding to the user input in plain English. Thinking about using a separate assistant for that specific purpose with two user inputs:
1- Initial request from the user
2- Data returned by the sql query that was executed against the database
and with the instructions being to format these data in plain English so it feels like a natural response to the initial request made by the user

Any thoughts on that?

I would follow the outline suggested by @DevGirl here: Private Chat with CSV data - #7 by DevGirl

This is the basic overview of the prompt that I followed:

You really want to define your table(s) as best you can to the LLM so it can understand how to translate the user’s request into the best SQL statement possible.

Right now, I return an array that I format as an html table for display. It sounds like you would want to take the response and send it back to the LLM to generate a plain English response. You can have it summarize the entire output, or iterate through returning natural language explanations of each row.

You should be able to do this easily with the Assistants API. Not sure why you’d want to use the Code Interpreter specifically for this, but even gpt-3.5-turbo-0125 is able to handle something like this.

1 Like

Thank you so much for your feedback.

I am using Code Interpreter with my current implementation, but if I switch to a “SQL based” approach , then it won’t be required. Better user experience with hopefully less latency, as well as lower operating costs. Definitively worth exploring!

1 Like

stick that data into mysql and run queries on it lol. why fiddle around “querying csv data” that doesn’t even make sense, and it’s not what csv is designed for at all…

I decided to give this a try myself. I added an option to my Text to SQL query screen to optionally “analyze” the results:

So now, I tell the LLM to generate a report from my database using SQL. Then, I send the resulting report back to the LLM to analyze and render a response based upon that analysis. It works really, really well – at least with gpt-4.5-turbo-preview.

This is really cool answer. We also have the similar requirement but for us semantic search is also needed :frowning: , any suggestion to achieve both?

FWIW this link is to an OpanAI cookbook showing how to generate the SQL query using the chat completion API in function argument generation mode.

I’ve used it, with the code ported to C# .net7 in VS2022, to generate queries into an Sqlite database containing tables downlaoaded from the IMDB public dataset.. It does a pretty decent job. My program then queries the dB and displays the result. This implements the idea posted by DevGirl above…

Here is an example of a solution where you can upload your CSV (just replace your employees private data to IDs and keep text description data): www . table-search . com/share-my-list.html

and start questioning with requests in a human-readable format

under the hood of the service are OpenAI Embeddings and VectorDB - similar to what you described