Company classifier thoughts?

Can anyone share advice/experiences/thoughts on using OpenAI models via API to classify 2,500 company names by type/industry/focus? (Specifically, cycling through 2,500 WordPress terms in the “company” custom taxonomy).

I think the models benefit not only from web knowledge but from linguistic inference…

But I have previously struggled to get them to stick to a consistent organising list of labels.

Also, how well would this stack up against a dedicated company classifier that may be out there? NAICS/SIC code stuff is too broad, but someone may have developed a good dedicated classifier? (Not that I can find one right now).

Just so I’m clear,

  1. You have a list of 2,500 company names.
  2. You want tag each of these entities with a label identifying what the company does.
  3. You think the NAICS and SIC codes are not granular enough.

Is that correct?

There are over 2,100 NAICS codes, unless nearly all of your companies fall within very few of the NAICS classifications I should think that would be a good place to start.

But if, for instance, you have 2,500 software publishers, they would all fall under code 513210, so that wouldn’t be helpful.

But… if you visit the 513210 NAICS page at https://www.census.gov, you’ll see they list several index entries for that code,

  • Applications development and publishing, except on a custom basis
  • Applications software, computer, packaged
  • Computer software publishers, packaged
  • Computer software publishing and reproduction
  • Games, computer software, publishing
  • Gaming site publishers
  • Mobile applications development and publishing, except on a custom basis
  • Operating systems software, computer, packaged
  • Packaged computer software publishers
  • Programming language and compiler software publishers, packaged
  • Publishers, packaged computer software
  • Software computer, packaged, publishers
  • Software publishers
  • Software publishers, packaged
  • Utility software, computer, packaged

Which could give you additional granularity and keep the classifications consistent.

But, if you have 2,500 Games, computer software, publishing companies, and want to dive deeper, you’ll need to come up with your own, bespoke, fine-grained categorization system.

As always, the more information you can provide the better quality help we can give.

1 Like

Yes, you’ve got the gist.
When I did this with an NAICS classifier via API a few years ago, I got some too high-level results. Maybe that was because I didn’t dig down through the hierarchies you mentioned, or maybe the sub-categories weren’t available; I don’t know.

Okay, so maybe I rediscover the API I used for NAICS company look-up, and maybe I get it to behave more granularly…

But how could OpenAI come in… ?

  1. I think I could ask it to classify using NAICS, and I seem to recall having some success (or hallucinations) with this…

But, as of today, ChatGPT, for instance, says it needs detailed knowledge about the companies and reports “Not enough information available” for the majority of a test subset. Note that the latest NAICS set was 2022, a year after the GPTs were trained.

In Playground, GPT 3.5 and 4 try harder, returning results - but they look like only higher-level categories despite by prompting to use NAICS children.

  1. So I’m wondering about the worth of just trusting the GPT with the job itself…

I think it benefits from web knowledge of companies (until 2021) and, where it doesn’t, might infer certain company activities from the name.

However, getting it to stick to a consistent dictionary of terms seems like a challenge… in previous tests, it went through the full list and tagged many companies with variants of “Advertising” - I mean variants even of the same intrinsic term.

It’s a Catch-22, because I don’t have such a dictionary to feed it. I think feeding it the latest NAICS may be a non-starter due to size.