I have a CSV with 50,000 employee records and I want to query the records. One approach I tried is created the embedding and stored the data in vectorDB and used the RetrievalQA chain. However the results are always wrong.
I then tried creating the create_csv_agent and it gives me the correct result. However this cosumes more tokens.
My question is what is right approach to query the CSV data.
I have done your first approach also and yes, it is extremely sensitive to the query, so you have to adjust very carefully the API parameters, like temperature.
The second is also the best for me, although it is, as you said, inevitably more token consuming.
Here the âkâ value is important. Now letâs say that CSV contains thousands of record and the query is expected to return variable data rowsâŚSo in this case how to adjust âkâ value?
Letâs say this CSV contains daily attendance data of people. The columns are Name, location, intime, out time etc. So if I want to query how many days, print number of days, dates a particular person was present.
The answer will have variable result. How to achieve this.
After tolerating the terrible output of gpt-4-turbo, back to real gpt-4 for an example that is still rooted in csv as a source, to give an idea of functions that an AI with function-calling ability could send to:
Given your requirements, it seems like you need a more flexible way to query your data. Pythonâs pandas library is perfect for this. It allows you to read your CSV file into a DataFrame, which you can then query in a variety of ways.
Hereâs how you can implement the functions you described:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('attendance.csv')
# Convert 'intime' and 'outtime' to datetime, assuming they are in 'HH:MM:SS' format
df['intime'] = pd.to_datetime(df['intime'], format='%H:%M:%S').dt.time
df['outtime'] = pd.to_datetime(df['outtime'], format='%H:%M:%S').dt.time
# Convert 'date' to datetime, assuming it's in 'YYYY-MM-DD' format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
def list_names(search=None):
if search is None or search == 'all':
return df['Name'].unique().tolist()
else:
return df[df['Name'].str.startswith(search)]['Name'].unique().tolist()
def search_time_records(query):
name = query.get('employee name')
date_range = query.get('date range')
time_range = query.get('time range')
records = df[df['Name'] == name]
if date_range:
start_date, end_date = pd.to_datetime(date_range)
records = records[(records['date'] >= start_date) & (records['date'] <= end_date)]
if time_range:
start_time, end_time = [pd.to_datetime(t, format='%H:%M:%S').time() for t in time_range]
records = records[(records['intime'] >= start_time) & (records['outtime'] <= end_time)]
return records
def employee_attendance_count(query):
records = search_time_records(query)
return len(records)
In this code:
list_names function returns a list of unique names in the DataFrame. If a search string is provided, it returns names that start with that string.
search_time_records function filters the DataFrame based on the provided query, which can include an employee name, a date range, and a time range. It returns the filtered records.
employee_attendance_count function uses search_time_records to get the relevant records and then returns the count.
You can call these functions with a dictionary representing your query. For example:
This code assumes that your date and time are in âYYYY-MM-DDâ and âHH:MM:SSâ formats respectively. If they are in a different format, you should adjust the pd.to_datetime calls accordingly.
Iâm sure youâre aware that an LLM is not a match for basic query language (SQL, et al) and for that reason, I suspect you are involving the LLM because you have a need for semantic language/NLP interface.
In other words, youâd like a system that allows a user to use natural language to retrieve data and this is the reason youâre involving an LLM.
If this is true, youâre in luck, because there is a simple solution that requires very few tokens, and is far quicker/more accurate.
To approach this, understand what each part of the equation does best. The LLM is used to understand the userâs request, while structured query language (not necessarily âSQLâ) is used recover the data.
For simplicity of this example, Iâll use SQL and provide you with a step-by-step to recreate the functionality you desire. My approach is different from @_j above â his eliminates the LLM where Iâm combining the two:
Import your CSV into a simple/quick datastore that can be queried based on dynamically created query language. The key being that itâs intended to interpret dynamic commands, which is why SQL is a good match. For this demo, Iâd go with something ultra-basic/quick/easy like MySQL
Create an LLM prompt that explains the CSV thatâs stored in MySQL (the table, the purpose of the columns, etc)
Write your own UI (or re-use one of the open-source UIâs available) to create a customer-facing chat-bot (much easier than it sounds, can be done in less than 50 lines of code and created by GPT).
Behind the scenes, every prompt will include your explanation of the data, asking the LLM to create SQL that produces the records requested by the user.
The LLM will return a SQL statement which youâll strip out. Return the LLMâs text answer (minus the SQL stripped out), along with the recordset returned by SQL.
In short, you have a perfect blend of LLM, procedural code (your script), and set theory (SQL/etc). Each form of technology/script doing what its best at!
And in turn ⌠You have a highly reliable piece of code that runs much less expensively, and far more accurately, than any other approach. Win-Win
I provide the example developer code backend for AI function calling, so AI can write a function request for the information, then answer the user about it.
The API parameter âfunctionsâ specified can be matched to one of the three functions shown to access particular aspects of data like was pondered in the original post (while not giving the AI carte blanche to write SQL queries).
There are trade-offs to an ORB approach as youâve suggested. I agree that if there were some security issue (in this case there is not; itâs an isolated container for a CSV) â abstracting to an ORB would be ideal.
In this instance, abstracting to an ORB ties-the-hands of the LLM whereas allowing the full power of a structured query language buys you more power/flexible querying and capability, assuming that the developer desires this.
Also â there may have been some confusion. Your solution is great, I âthankedâ it, and I appreciate all of your posts. I was not criticizing, not in any way. I was explaining the difference in approach.
Itâs really a cool way to use the LLM and your existing data without having to embed it.
A few quick examples I cut and pasted from my log:
please list the last 20 log type = sql entries. Just display the descriptions.
what was the last model used for log type âsqlâ?
what was the last model used for sql queries?
show me 10 queries from users other than Reggie or somebody. Only show the user, question, notes and date.
when was the first query I posted?
what have my queries been about?
how many queries have I made?
what has been the average cost of my queries?
show me the total time elapsed for all of my queries?
show me the dates and descriptions and cost and elapsed time and models used for the last 5 questions I asked.
@SomebodySysop : Awesome post and recommendations. Thank you!
I have built myself an app where my customers can interact with their data (datasets from CSV files attached to the assistants), and I am running into serious latency and performance issues (Assistant API, Code Interpreter, gpt-4-1106-preview). So I want to investigate an alternative approach that would generate some SQL code based on the user input and with the SQL being executed against a MySQL database that already contains the customerâs data.
Where I am struggling is how to format the data returned by the SQL code so it feels like the AI is responding to the user input in plain English. Thinking about using a separate assistant for that specific purpose with two user inputs:
1- Initial request from the user
2- Data returned by the sql query that was executed against the database
and with the instructions being to format these data in plain English so it feels like a natural response to the initial request made by the user
You really want to define your table(s) as best you can to the LLM so it can understand how to translate the userâs request into the best SQL statement possible.
Right now, I return an array that I format as an html table for display. It sounds like you would want to take the response and send it back to the LLM to generate a plain English response. You can have it summarize the entire output, or iterate through returning natural language explanations of each row.
You should be able to do this easily with the Assistants API. Not sure why youâd want to use the Code Interpreter specifically for this, but even gpt-3.5-turbo-0125 is able to handle something like this.
I am using Code Interpreter with my current implementation, but if I switch to a âSQL basedâ approach , then it wonât be required. Better user experience with hopefully less latency, as well as lower operating costs. Definitively worth exploring!
stick that data into mysql and run queries on it lol. why fiddle around âquerying csv dataâ that doesnât even make sense, and itâs not what csv is designed for at allâŚ
So now, I tell the LLM to generate a report from my database using SQL. Then, I send the resulting report back to the LLM to analyze and render a response based upon that analysis. It works really, really well â at least with gpt-4.5-turbo-preview.
FWIW this link is to an OpanAI cookbook showing how to generate the SQL query using the chat completion API in function argument generation mode.
Iâve used it, with the code ported to C# .net7 in VS2022, to generate queries into an Sqlite database containing tables downlaoaded from the IMDB public dataset.. It does a pretty decent job. My program then queries the dB and displays the result. This implements the idea posted by DevGirl aboveâŚ
Here is an example of a solution where you can upload your CSV (just replace your employees private data to IDs and keep text description data): www . table-search . com/share-my-list.html
and start questioning with requests in a human-readable format
under the hood of the service are OpenAI Embeddings and VectorDB - similar to what you described