Here’s challenge that I think represents some challenges that people struggle with in dealing with ‘user data’. The Fortune 500 list is about as simple as it gets, see attached. 500 records and a header.34kb as a CSV.
I’m curious how much success people have with the different models to get it to correct answer some questions:
What is the state with the most Fortune 500 companies?
How many Fortune 500 companies are in <>
Give me a list of the Fortune 500 companies in Georgia, sorted by number employees.
Is NCR on the list - and if so what is it’s rank and State.
I have found it impossible to consistently have it answer these correct use the Assistant strucutre with CSV or JSON version of the file added to the assistant.
Direct inference might perform poorly, but you can just ask it to use python.
Resubmitted because had to expand analysis for code.
import pandas as pd
# Load the CSV file
file_path = '/mnt/data/Fortune 500 2023.csv'
fortune_500_data = pd.read_csv(file_path)
# Display the first few rows of the dataframe to understand its structure
fortune_500_data.head()
# Counting how many Fortune 500 companies are based in California
california_companies_count = fortune_500_data[fortune_500_data['State'] == 'California'].shape[0]
california_companies_count
Yes I’m sure it can be do it that way. Just like a pivot table in Excel works wonders. Not sure if I’m willing/able to offer that to my corporate users for vairous reasons.
But cool!
Can I ask why you don’t think you can? Are you worried about making the infrastructure to run and test code?
If you are, I have a pretty simple system where I have a llm class object for my projects. The class has a “run_code()” function on it. Basically on response from the prompt, the LLM creates some code, the code gets saved to disk, then loaded as a functional skill:
Old code, found a better way of doing it.
def get_functional_skill(skill: object):
"""Gets the template content of the functional skill."""
path = save_functional_skill(skill)
spec = importlib.util.spec_from_file_location(skill["skill_name"], path)
func_skill = importlib.util.module_from_spec(spec)
spec.loader.exec_module(func_skill)
return func_skill
But then you just call it with func_skill.do_the_thing().
I use a similar system to run my custom functions! But at least there I know what code will be running. Just look at the Langchain securty bulletins like this one: https://ntietz.com/blog/langchain-rce/
I think I might be most annoyed that running it in plain ChatGPT 4 it seems to answer these questions correct - but in my Assistant structure with the file uploaded to the assistant I am not able reproduce those results
You have probably already figured this out, or, moved on by now. But, for me, I find that if I provide the files in the initial chat and tell it to read and learn from every word of each of those files, and not to quit reading those files until it has done so, and understands all of it, it works about 95% of the time. I also usually include some kind of out line for the steps I want it to take to reach the users result in a system message, which includes reading those files in the first step before doing anything else. You should also always tell it to tell you if it is having trouble accessing any files you tell it to use. And tell it to try again if it ever does so. This always fixes the other 5% for me. Either way though, in my opinion, the assistants can be pricey when you have a ton of data, are using other api’s, like dall-e-3, etc. I’d just build a GPT, that way you could just upload your files, press enter, and get your pre configured questions answered from your data, and it cost you nothing extra. System messages are not always listened to by assistants, or any gpt for that matter. But, the actual chats are highly succeptable to data success in the form of files, and user input. Just something to think about. And, GPT’s can take in 300 pages of data in one chat input, and output 25 or so. Plenty for most data questions. Hope this helps, and I’m sorry it’s so long, lol.