Best solution for multilabel classification


I came here to seek confirmation, of whether I’m moving in the right direction, and also to discover new possible solutions for my problem.
Core data
I have a classification document, containing labels on structurally 2 levels. Each element has a name and description. On the first level, there are 180 elements, and on the second level 1500, so 1680 combined.
My mission
I aim to build a microservice that puts multiple labels from my classification document to input text (max length 250 words).
My solution
Currently I’ve been looking into OpenAI Classification documentation, Embeddings documentation, Semantic search documentation and I’ve come up to a possible solution using /classification endpoint.

  1. I use my classification document to fine-tune a model to teach how input text is generally labeled. I can make ~1680 examples into the training file, even tho that might not even be necessary to use so many examples.
    1.1. Alternatively/additionally, I have another document, where there are already ~100 input texts, which are multilabelled (~75 labels related with each input text, and that’s also my ideal outcome of that microservice).
  2. OpenAI classification documentation says, that I can maximum use 200 labels for classification. I will use labels from the 1st level of the classification document (there where 180 labels).
  3. I want to set some threshold, based on what the labels either exist or not on the input text. Let’s say, I get ~15 labels on the 1st level. Now, I will make 15 separate API calls, and into the API call, every time there are added corresponding sub-labels of the parent label.
  4. Again, a threshold is set, which decides whether the sub-label exists or not on the input text.
  5. When the procedure is finished, then I should have multiple labels on my input text - the 1st level labels and the corresponding 2nd level sub-labels for each parent label.

Currently my concerns are:

  • will it be too costly to allow users use this service on a daily basis (~100 requests a day)? Because each request is going to make around ~20 API calls, which are also using fine-tuned model (as I understood, using fine-tuned model costs more).
  • will this solution even work as I intend to?
  • would there be a more cost-friendly solution?
  • which models to use? I can’t find too much documentation about classification on GPT3.5 or turbo. Only ada models.

Thank you ahead for your input!

With that many labels you will be hard pressed to get good performance from a fine-tune, because of the vast amount of perfect training data required to get such a high SNR.

Plus running it will be pricey-er because it is a fine-tune.

Instead, what I would do, it take your training data, embed the text, and create a vector database with the embedding vector and the label.

Then when a new thing comes in, embed it, and correlate it with your previously embedded data. Take the highest correlation and pull the label from the highest correlation and use that to label the new incoming text.

This scales, is relatively cheap, and you can dynamically add more data on the fly.