Extreme Multilabel classification

How can we solve extreme multilabel classification using openai/LLM.
I have more than 50k labels , How can we deal with it?

There is no way to even provide that many words to an AI within the context length where it could comprehend -

50000 “labels” are pretty close to the size of a dictionary that can convey most languages, so they essentially become meaningless.

If you had tons of very specialized labels that could not be extracted by algorithmic word searches (say “18th-century French poetry”, “Recipes better with celery salt”, “Boris Yeltsin administration”, or what have you), the only way I see is to pass perhaps 1000 terms at a time over 50 AI classifications or extractions of the same text, and let the AI know that no keywords are to be appended without an exact semantic content match.

1 Like

@_j has a point with the large number of labels. What are you trying to classify into what?

1 Like

you may be able to use embeddings i.e. ada02 to predict the labels. Milage will vary by uniqueness of the corpus and how representive your training data is.

However I agree with @_j that with 50,000 labels it would be meaningless, what’s the use case?

1 Like

Do you have text descriptions associated with your labels? If so, embeddings using associated text descriptions might be a start of an approach.
Have gpt ‘tl;dr’ your input, then embed that and do similarity.

Or some variant of that.


@bruce.dambrosio has a nice approach.

If you want to do on full text. Here’s a way:

  1. Categorise labels into high level groups.
  2. Classify the data to the matching high-level group.
  3. Classify the text again to find out the best label from the group in step 2.

Note: To categorize labels, use embeddings to cluster them


Ooo, I like that, that is nifty, added to toolbox of nice ideas.

1 Like

I have a csv file that contains two column Text and Multiple labels associated with that text.
And total number of unique labels are more than 50k.

and If I send new text model should predict labels as predicted for csv file from the set of 50k labels.