I have a similar Multi-label Classification use case where I have around 10K labeled samples with 30 labels. I have some questions if you can answer them…
Will it be feasible to finetune any openAI model on this much data?
How to prepare the target feature to finetune the model?
Did you get the solution to convert the data to JSONL for multi-label? If yes then please must share.
Also can you please outline the steps to perform multi-label classification using GPT3.
This thread is quite old, and the GPT-3 base models are no longer available.
Instead, for fine-tune, you would have your choice of davinci-002 (likely gpt-3.5-turbo-sized instead of the 175B parameter GPT-3), or a chat AI model, such as gpt-4o-mini, that comes with abilities to instruct and have directions followed.
I offer a technique more reliable for selections from a small label set: use structured outputs, with a boolean schema answering true/false for all 30. Then you can just have JSON fields such as "matches_keyword_technology": false.
This allows you to proceed with the inexpensive -mini model and just prompting + data, to see if context instruction alone will fulfill the task with quality. Every allowed keyword is evaluated, instead of trying to predict correct ones from thin air. Thousands of inferences before you would break-even after fine-tune on shorter output.
(being able to turn off the injected response format on a fine-tune model would be a nice bonus)
I think the idea is great. Depending on the nature and diversity of labels, it could potentially lead to overclassification though? It could be useful to overlay it with some prioritization/ranking or some such - provided that gpt-4o-mini can still handle that in one go.
I have one fine-tuned gpt-3.5-turbo model for multi-classification, which needs to select from about 100 different labels. In my particular case, I have found it useful to have the prioritisation baked in so I only get the top 3 labels. In cases where I leave it open-ended, it would often return up to 10 different labels.
I consider the input to be larger than schema and understandable for an ideal employment of the technique and economy. It could work for tagging articles for blog categories for example.
fine-tune would be where the criteria just can’t be understood or explained within context comprehension, and reinforcement learning would be the path to producing the desired output.
Yes overclassification seems to be a more generic issue that we mostly face in Multi-label clasification. But do we have a mechenism to solve this issue as in my case the number of labels at the output is not fixed. They can vary between 2 to 6. So, how I can select the exact number of predicted labels?
What is the condition that determines the expected number of labels, if it can be from 2 to 6?
Instead of directly prompting to give binary True/False values for each class, you could ask for a score for each class, and then pick the top-N classes (those with highest scores). I tried something like this in here, but did not get it working too well for our application.
Another way would be to prompt for binary classification, and get scores for the classes from token probabilities.