CSV File How to best parse larger data?

I have a problem and hope someone can help.
I have a fil of 10.000 rows with 2 colums. The name of a company and nummer if employees. And i want Chatgpt to categories them in Small, middle and so.

If i use the interface and gtp 3.5 with one sample Data it works great.
If i do a copy and paste of 50 rows and tell him to do a table it works great.

But i cant copy past more than lets say 100. To work in practise i need automation.

So i wrote a scipt and tried it via API.
If i use all data via API i got an error. So does anyone know how many data i can upload in a jason payload?

Ok so next try was to go trhout it line by line. That worked but had 2 Disadvantages.

  1. A lot of token and run into rate limits soon
  2. The quality is not nearly the same are in the larger copy and paste style

So can anyone help and did the same?
And if not maybe answer these question?

  1. What id the json upload limit on size?
  2. What parameters uses Chgpt 3.5. in the interface?
  3. How can i make sure the Api REALLY give back one word as outbput and not sentences?

for the one word you can ask for a format like

display the word like this:

[onlyonewordhere]<<

No conclusion, no description, just answer with one word in given format.
If it is not possible answer with

NO

Hi @t.olscha

Welcome to the community.

For your use-case, I’d recommend reading docs for embeddings.

EDIT: On re-reading your post, it seems that you need to come-up with your criteria for SML and then you can simply do it in a spreadsheet - as simple as writing a formula.

1 Like

If all that you need is to classify a company into a few categories like ‘Small’, ‘Medium’, and ‘Large’, based on the number of employees, maybe you can ask ChatGPT to write a logical program to do the classification.

1 Like

I have done many similar projects and with great success and at low cost.

It’s big; nearly infinite. You just need to know how to manage data and AI at scale.

HOWEVER


Before I say more, you need to lay out more depth in your goal. I understand the following:

  1. ~10,000 rows by 2 columns; name and number of employees
  2. Desired classification; SML

You’ve said nothing about:

  • This this a one-time classification process?
  • The nature of the classification boundaries; are they fixed? Dynamic? Statistically generated?
  • The pace of change; how often does the list change?
  • The possibility that over time classification boundaries will change.
  • Is there a financial budget for this goal? If so, at least describe it.
  • Who will use this data? Humans? Machines? Both? How will they use it and to do what?

Give us more sense about your business objectives so we can advise you well.

1 Like

Hey Jochen, danke fĂŒr deine Antwort wo genau wĂŒrdest du das [onlyonewordhere] platzieren?
S0?
Im Format: Anzahl Mitarbeiter,GrĂ¶ĂŸe Unternehmen[onlyonewordhere] ?

When i read your repsonse i also thought emedding is not the right thing. And yes everything works in general. But not in concrete.

Yes, but indeed it want to do larger things. And i want to understand what process is the best for larger files. To do it step-by-step line be line or in total.

nein, das wĂŒrde ich noch anders formatieren

Analyze this:

[[text start

(hier der text

text end]]

Give me:

{
‘mitarbeiterzahl’: [0-9],
‘firmenname’: ‘’,
'

}

vielleicht fĂŒr die Wertevorgabe auch mal schema.org anschauen (also nicht nur regulĂ€re AusdrĂŒcke).

Und fĂŒr große Texte viel mit diesen Metadaten extrahieren und dann summarize (evtl. auch graphen db fĂŒllen).

@bill.french Wanted to know on the possibility of implementing statistics where I can feed the models large dataset and then query out any required statistic as mentioned in below example:

In below image, I have user visited apps from location and during what timestamp as data which will be huge

image

Now, I should be able to query average count of visitors within some timestamp, I can query count of visitors visited yesterday for ST1 (consider STName as an application name) and so on


how should this be approached with openai?

Stating what you “should” be able to do is a hypothesis that must also factor in practical boundaries.

Time-series data is typically raw and voluminous. Intentionally, IoT signals are collected to ensure real-time perturbations can be detected and corrected. This is especially important for mission-critical processes where a few missed events sometimes indicate a big problem. But time-series data is also valuable for machine learning. Your use case is neither - you are looking for the lazy pathway to analytics. And I’m fine with that - I love a good lazy approach - it’s how great innovations are made. :wink:

IN THIS CASE, the AI “fit” is a reach (based on my skill set, known approaches, and financial practicalities). Practically speaking, I assess that this is a round hole and giant earth mover problem. Putting a mega earth mover in a small hole has one challenge - physics. AI interfaces (UIs and APIs) are presently limited; they’re tiny holes. There are indications we will soon see 100k prompt capabilities. But that’s nothing compared to extremely granular time-series data - at least the volume that would produce valid assessments.

You can’t take a slice of the series and expect your analytics to be valid; the entire point of analytics is to factor in lots of data. As such, the only rational pathway I can see is to aggregate and then expose the aggregated data to the AI model in the form of discrete learner prompts.

Aggregation approaches take many forms and I believe there are some clever things that can be done to support an increasingly adept AI process.

This is to say that all the crap we learned about data, aggregations, summations, and analytics – before LLMs became popular – still matters.

2 Likes

Thanks for the guidance. I was a bit sure but confused as well, being new to this field as to how can I solve this problem with better approach.