Extreme Multilabel classification

vikas.s0302 · August 8, 2023, 11:25am

How can we solve extreme multilabel classification using openai/LLM.
I have more than 50k labels , How can we deal with it?

_j · August 8, 2023, 11:34am

There is no way to even provide that many words to an AI within the context length where it could comprehend -

50000 “labels” are pretty close to the size of a dictionary that can convey most languages, so they essentially become meaningless.

If you had tons of very specialized labels that could not be extracted by algorithmic word searches (say “18th-century French poetry”, “Recipes better with celery salt”, “Boris Yeltsin administration”, or what have you), the only way I see is to pass perhaps 1000 terms at a time over 50 AI classifications or extractions of the same text, and let the AI know that no keywords are to be appended without an exact semantic content match.

sps · August 8, 2023, 11:39am

@_j has a point with the large number of labels. What are you trying to classify into what?

lids · August 8, 2023, 5:07pm

you may be able to use embeddings i.e. ada02 to predict the labels. Milage will vary by uniqueness of the corpus and how representive your training data is.

However I agree with @_j that with 50,000 labels it would be meaningless, what’s the use case?

bruce.dambrosio · August 8, 2023, 6:36pm

Do you have text descriptions associated with your labels? If so, embeddings using associated text descriptions might be a start of an approach.
Have gpt ‘tl;dr’ your input, then embed that and do similarity.

Or some variant of that.

sps · August 8, 2023, 6:46pm

@bruce.dambrosio has a nice approach.

If you want to do on full text. Here’s a way:

Categorise labels into high level groups.
Classify the data to the matching high-level group.
Classify the text again to find out the best label from the group in step 2.

Note: To categorize labels, use embeddings to cluster them

Foxalabs · August 8, 2023, 7:03pm

Ooo, I like that, that is nifty, added to toolbox of nice ideas.

vikas.s0302 · August 9, 2023, 8:00am

I have a csv file that contains two column Text and Multiple labels associated with that text.
And total number of unique labels are more than 50k.

vikas.s0302 · August 9, 2023, 8:04am

and If I send new text model should predict labels as predicted for csv file from the set of 50k labels.

Topic		Replies	Views
Best solution for multilabel classification API embeddings , classification , semantic-search	1	857	October 20, 2023
Multiple labels in the file for multi-class classification task API	4	1417	February 19, 2023
GPT-3 for custom dataset classification with custom labels API	1	994	April 11, 2022
How Can I Use the OpenAI API to Categorize Large Amounts of Text Data? API classification	3	2656	May 23, 2023
How do I handle a large number of classes for classification API	11	449	February 28, 2024

Extreme Multilabel classification

Related Topics