Company classifier thoughts?

parakeet · September 22, 2023, 9:20pm

Can anyone share advice/experiences/thoughts on using OpenAI models via API to classify 2,500 company names by type/industry/focus? (Specifically, cycling through 2,500 WordPress terms in the “company” custom taxonomy).

I think the models benefit not only from web knowledge but from linguistic inference…

But I have previously struggled to get them to stick to a consistent organising list of labels.

Also, how well would this stack up against a dedicated company classifier that may be out there? NAICS/SIC code stuff is too broad, but someone may have developed a good dedicated classifier? (Not that I can find one right now).

elmstedt · September 23, 2023, 12:18am

Just so I’m clear,

You have a list of 2,500 company names.
You want tag each of these entities with a label identifying what the company does.
You think the NAICS and SIC codes are not granular enough.

Is that correct?

There are over 2,100 NAICS codes, unless nearly all of your companies fall within very few of the NAICS classifications I should think that would be a good place to start.

But if, for instance, you have 2,500 software publishers, they would all fall under code 513210, so that wouldn’t be helpful.

But… if you visit the 513210 NAICS page at https://www.census.gov, you’ll see they list several index entries for that code,

Applications development and publishing, except on a custom basis
Applications software, computer, packaged
Computer software publishers, packaged
Computer software publishing and reproduction
Games, computer software, publishing
Gaming site publishers
Mobile applications development and publishing, except on a custom basis
Operating systems software, computer, packaged
Packaged computer software publishers
Programming language and compiler software publishers, packaged
Publishers, packaged computer software
Software computer, packaged, publishers
Software publishers
Software publishers, packaged
Utility software, computer, packaged

Which could give you additional granularity and keep the classifications consistent.

But, if you have 2,500 Games, computer software, publishing companies, and want to dive deeper, you’ll need to come up with your own, bespoke, fine-grained categorization system.

As always, the more information you can provide the better quality help we can give.

parakeet · September 23, 2023, 8:02am

Yes, you’ve got the gist.
When I did this with an NAICS classifier via API a few years ago, I got some too high-level results. Maybe that was because I didn’t dig down through the hierarchies you mentioned, or maybe the sub-categories weren’t available; I don’t know.

Okay, so maybe I rediscover the API I used for NAICS company look-up, and maybe I get it to behave more granularly…

But how could OpenAI come in… ?

I think I could ask it to classify using NAICS, and I seem to recall having some success (or hallucinations) with this…

But, as of today, ChatGPT, for instance, says it needs detailed knowledge about the companies and reports “Not enough information available” for the majority of a test subset. Note that the latest NAICS set was 2022, a year after the GPTs were trained.

In Playground, GPT 3.5 and 4 try harder, returning results - but they look like only higher-level categories despite by prompting to use NAICS children.

So I’m wondering about the worth of just trusting the GPT with the job itself…

I think it benefits from web knowledge of companies (until 2021) and, where it doesn’t, might infer certain company activities from the name.

However, getting it to stick to a consistent dictionary of terms seems like a challenge… in previous tests, it went through the full list and tagged many companies with variants of “Advertising” - I mean variants even of the same intrinsic term.

It’s a Catch-22, because I don’t have such a dictionary to feed it. I think feeding it the latest NAICS may be a non-starter due to size.

Topic		Replies	Views
Auto-tagging articles - any thoughts? API	13	3045	May 31, 2023
How Can I Use the OpenAI API to Categorize Large Amounts of Text Data? API classification	3	2780	May 23, 2023
Help with fine-tuning for text categorization API	4	833	December 16, 2023
Crafting a Simple "Zero-Shot Classifier" Using APIs - Seeking Your Insights! API chatgpt , api	11	2806	July 26, 2023
Best solution for multilabel classification API embeddings , classification , semantic-search	1	879	October 20, 2023

Company classifier thoughts?

Related Topics