How can we solve extreme multilabel classification using openai/LLM.
I have more than 50k labels , How can we deal with it?
_j
2
There is no way to even provide that many words to an AI within the context length where it could comprehend -
50000 “labels” are pretty close to the size of a dictionary that can convey most languages, so they essentially become meaningless.
If you had tons of very specialized labels that could not be extracted by algorithmic word searches (say “18th-century French poetry”, “Recipes better with celery salt”, “Boris Yeltsin administration”, or what have you), the only way I see is to pass perhaps 1000 terms at a time over 50 AI classifications or extractions of the same text, and let the AI know that no keywords are to be appended without an exact semantic content match.
1 Like
sps
3
@_j has a point with the large number of labels. What are you trying to classify into what?
1 Like
lids
4
you may be able to use embeddings i.e. ada02 to predict the labels. Milage will vary by uniqueness of the corpus and how representive your training data is.
However I agree with @_j that with 50,000 labels it would be meaningless, what’s the use case?
1 Like
Do you have text descriptions associated with your labels? If so, embeddings using associated text descriptions might be a start of an approach.
Have gpt ‘tl;dr’ your input, then embed that and do similarity.
Or some variant of that.
2 Likes
sps
6
@bruce.dambrosio has a nice approach.
If you want to do on full text. Here’s a way:
- Categorise labels into high level groups.
- Classify the data to the matching high-level group.
- Classify the text again to find out the best label from the group in step 2.
Note: To categorize labels, use embeddings to cluster them
2 Likes
Ooo, I like that, that is nifty, added to toolbox of nice ideas.
1 Like
I have a csv file that contains two column Text and Multiple labels associated with that text.
And total number of unique labels are more than 50k.
and If I send new text model should predict labels as predicted for csv file from the set of 50k labels.