Since there is a hierarchy of categories, say red, orange, blue, violet.
You train on warm vs. cold colors, so red/orange vs. blue/violet.
The once you get this result, you break down further, so train on distinguishing red vs. orange with one model, and distinguish blue vs. violet with another model.
So you have 3 models total. One to break down warm vs. cold, and 2 follower models to break down further depending on where the first branch occurred.
Each of these models is a binary choice, and requires less data to make each choice … but there is no winning since you have to create the other models that make the additional choices. 
In the end, I feel it’s all the same, and you need lots of data to distinguish hundreds of categories.
The binary model is very organized, but you need more models. You can use a fine tune or embeddings for each decision.
Fine-tunes tend to be black boxes, and embeddings are opaque to transparent boxes.
In the end, lot’s of work either way because of the high amount of categories involved.
Or you can shift work to cost, by bootstrapping with a multi-shot model to create your labeling/training data.
I have been experimenting more with embeddings as classifiers myself.
Classifiers using embeddings can have any topology too, so flat, like clusters (which is what I do) or binary trees (haven’t tried this one with embeddings).
So based on the method, I tend to gravitate towards a topology.
My preference is that with a fine-tune, I would use a binary tree, and with an embedding approach, I would use a flat correlation approach.