I’m trying to do multilabel multiclass classification. What is the best approach, considering I have >30K classes, so I can’t fit them into the prompt to show the model what I want as a response? Is this a job for fine-tuning? Any advice would be great!
This has come up a number of times before.
My go-to suggestion is to define a classification taxonomy and do it in a multistage effort.
For instance, maybe you can structure your classifications into 50 equally sized “species” of classes, then maybe each species can be grouped into 30 equally sized “genera.”
So, first you classify by species, then within that species by genera. You’ve reduced your space by a factor of 1500. Then, you just need to classify among the 20 members of the genus.
Alternately, you might look at embeddings, if you have 10 to 30 examples of each of your classes, then embedding them and the object to be classified might yield some insight. You’d do some type of mean reciprocal ranking for all of your 30,000 classes against the new object to be classified, pick the highest ranked classes, then try to classify among just those.
Or even both.
30,000 classes is a lot of classes, it’s not a particularly easy thing to accomplish in an unsupervised manner, especially if the distinctions between some classes are nebulus and open to interpretation.
Allow me to be curious for a moment, but what kind of data has 30k possible lables/classes?
One example,
https://storage.googleapis.com/openimages/web/index.html
Edit: Well 20,000 classes anyway,
61,404,966 image-level labels on 20,638 classes
Good example, I’ve bookmarked that one, pretty interesting dataset!
Besides the multistage filter classifier that @elmstedt mentions … which is a good approach if it fits … you could look at training your own model.
One approach could be, say, embed the thing, get a vector, feed this vector into your own neural network, where at the end are 30k “buckets” of values ranging continuously from 0 to 1. The values closer to 1 would be in-class for the class represented in that slot.
I know, it’s a lot of work. But it’s almost required.
You could also try the “cheap” version. Just use labeled embeddings, and anything that correlates very similarly has the same set of labels attributed to it.
Not sure which one would work better for your exact situation. Guessing the trained neural network would, because it could discover relations that aren’t in your labeled data. But the labeled embeddings are a lot easier to try out.
I would personally go with a mixture of @elmstedt’s embedding approach and hierarchical agglomerative clustering. It might be slow, but it doesn’t matter as much if you only have to do it once.
My labels are actually conveniently in a hierarchical taxonomy structure already, so this multistage technique might be a good approach. One concern I would have, though is a scenario like this…
I’m trying to classify “pandas like to eat bamboo”.
I have an “Animal” category that has “Panda” in it, which would classify it as Panda, which would be partially correct. But what if I also have a “Plant” category with “Bamboo” in it? Would there be a way to get it to multilabel it as both “Panda” and “Bamboo” if they are from two separate categories?
This type of thing happens a lot in my data, so would that steer me away from a technique like this?
I will definitely look into the embedding technique you mentioned. Some of the things you said are unfamiliar, so I’ll have to do some research, but that certainly sounds promising. Would it be sort of like RAG? Like have it go find the embeddings that are close so that I have a small enough number of tokens to fit into a GPT context window?
I’m not opposed to training my own model, although it would require time and effort I may or may not be able to spend. I’ll keep that one in my back pocket in case other things don’t work well enough.
The “cheap” version you mention sounds similar to what I’ve been doing. It is not terrible, but I want it to be better. Also, the current way I’m doing it only gives me single labels, so part of my project now is to get multilabel going.
I’m trying to understand this. Would the clustering be to get my classes organized? If so, they’re already structured (Sorry, I left that out of my original message). Or would the clustering be used in a different way?
I’m noticing that some of the suggestions here don’t seem to be related to GPT prompts. Is that just not the right tool for this job? I’m not attached to the idea, but it does seem like the GPT models are very good at interpreting the words in my input text.
I’m no expert but going the gpt/llm route is certainly one of the most expensive ways to do this, but given the complexity of your use-case it might make sense.
Perhaps someone much smarter than me can explain the merits of each strategy
@sarah.n.golden as you have so many labels and they are in a taxonomy, it sounds that you could try Annif (it currently does not use any GPT models but only traditional machine learning and NLP algorithms): GitHub - NatLibFi/Annif: Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
However, I write also to ask if you ended using some LLM approach? How did it perform?