How can we solve extreme multilabel classification using openai/LLM.
I have more than 50k labels , How can we deal with it?
There is no way to even provide that many words to an AI within the context length where it could comprehend -
50000 “labels” are pretty close to the size of a dictionary that can convey most languages, so they essentially become meaningless.
If you had tons of very specialized labels that could not be extracted by algorithmic word searches (say “18th-century French poetry”, “Recipes better with celery salt”, “Boris Yeltsin administration”, or what have you), the only way I see is to pass perhaps 1000 terms at a time over 50 AI classifications or extractions of the same text, and let the AI know that no keywords are to be appended without an exact semantic content match.
you may be able to use embeddings i.e. ada02 to predict the labels. Milage will vary by uniqueness of the corpus and how representive your training data is.
However I agree with @_j that with 50,000 labels it would be meaningless, what’s the use case?
Do you have text descriptions associated with your labels? If so, embeddings using associated text descriptions might be a start of an approach.
Have gpt ‘tl;dr’ your input, then embed that and do similarity.
Or some variant of that.
@bruce.dambrosio has a nice approach.
If you want to do on full text. Here’s a way:
- Categorise labels into high level groups.
- Classify the data to the matching high-level group.
- Classify the text again to find out the best label from the group in step 2.
Note: To categorize labels, use embeddings to cluster them
Ooo, I like that, that is nifty, added to toolbox of nice ideas.
I have a csv file that contains two column Text and Multiple labels associated with that text.
And total number of unique labels are more than 50k.
and If I send new text model should predict labels as predicted for csv file from the set of 50k labels.