GPT3 Finetuning for Multilabel Classification

Hi all,
has anyone used GPT3 finetuning for multilabel text classification?

I would highly appreciate to see an example how to structure the completion and learn from best practices.

For clarification:
My ask is not about the common (multiclass) classification use case.
Multilabel means that each instance has 1 or more labels associated.

Thank you


@akopp Find any good resources here? I’m poking around myself. If I find anything, I’ll post it here as well.

In my case I was able to achieve good results using entity extraction similar to this example: OpenAI API

Not sure if this is generally a good approach for multilabel but in my scenario it worked quite well.

I don’t see why you wouldn’t be able to use their finetune classification example for multiclass… You would just feed multiple classes into the training data, right? Give it a go!

I am also interested in how we can fine-tune the model for multi-label classification. If anyone has figured it out let me know please or if there is any notebook on how to do that?

1 Like

@saraahmadi321 @akopp @aaron5 have you guys found any solutions? I want to do something similar where i want to multi-label customers feedback. Thanks in advance!

1 Like

You can create a fine-tune by creating a jsonl file with example prompts and completions. So a file with hundreds or thousands of lines like this:

{“prompt”: “Your company is awesome.\n\n###\n\n”, “completion”: " 1"}
{“prompt”: “Your company is bad.\n\n###\n\n”, “completion”: " 0"}

Note the space before the completion output. Also, the '\n\n###\n\n" is what I used as a stop word, which is the same one recommended by OpenAI.

Then to create the fine-tune, follow along with the docs:

Here would then be the parameters you would send when you call your fine-tuned model. The important ones being the temperature of 0, and in my case, since I am just outputting single character labels, I set max_tokens to 1.

{“temperature”: 0, “max_tokens”: 1, “top_p”: 1, “logprobs”: 2, “frequency_penalty”: 0, “presence_penalty”: 0}

Be sure to append the stop word of ‘\n\n###\n\n’ to the input prior to sending it off to the trained fine-tune.

Note: You can only fine-tune base models right now. So ‘davinci’ and not ‘text-davinci-003’.

For fine-tunes with a lot of training data, you can get away with using ada or babbage for classification. But you can train various versions and see which ones work better for you. The higher end models such as curie or davinci would be used if you have limited training data.

That’s basically the high level advice on creating a fine-tune.

Good luck!


This is great advice. I was playing around with fine-tuning as well for this. If you limit the max tokens to 1 though isn’t that still a multi-class instead of a multi-label model? Even with fine-tuning I’ve been struggling to get the model to spit out the correct set of labels using the TRAINING data… sigh.

Which model did you fine-tune?

How many prompt/completion pairs did you provide it for training?

Did you include the stop token?

The classification would map to single character output so, ‘0’, ‘1’, … ‘a’, ‘b’, … ‘z’

Then you map this to what the meaning is.

To map one thing to two or more different categories, you need to run two or more models against the same input. But the same idea.

Oooooooooh I think I see what you mean. I can map a single character to multiple labels. That’s clever! The data-prep CLI tool recommended I use ada so that’s what I went with. I did NOT include the stop token (which I will now do). I fed it ~200 prompts with 5 labels/completions. Even with the temp at 0 I was getting some wild outputs.


With 200 training points, you might want to bump up to babbage or curie (and davinci as a last resort). The higher models might be able to “learn quicker” since they have more parameters.

Otherwise beef up your training data if you want to use ada and the stop token doesn’t fix it.

Hi Curt,
I am doing multiclass classification with 230 labels and 120k training samples. What model would you recommend?

I would start with Babbage and go up if the results aren’t good. But that is a ton of labels. Maybe break it up into several course classifier groups up front and further refine with additional classifiers.

1 Like

@dml1002313 Did you solve your issue. I’m looking to perform multi-label classification of text into 7 labels. None, one or many labels can apply to the text, I already have my dataset in the format : ‘this is a sample’,0,1,0,1,1,0,0,1 and trying to work out how best to structure my dataset for fine-tuning GPT3 so that I can get a response such as :
prompt: this is an example test
completion: 1,1,1,0,1,0,0,0

1 Like

@aaron5 @jerome1
hello, have you guys found any solutions? I want to do something similar where i want to multi-label classification. Thanks in advance!

@maysaa.khalil Yes. But I also forget the exact context I was thinking about when I responded up top lol. Can you do a super quick summary of your use case?

Hi @akansel thanks a lot for your reply, it is the idea or solution on how to fine-tune a multi-label classification problem with openai models. Thanks again :slight_smile:

@maysaa.khalil We’ve almost entirely migrated away from using fine-tuned models for classification in favor of 3.5 turbo. It’s cheaper and faster and when you combine it with validation I think it’s more effective. On that same note, historically we’ve used fine-tuned models to perform the validation check, but even in that case we’re evaluating using 3.5 turbo instead.

To speak specifically to multi-label classification without knowing the specifics of your use case might be tricky, but overall I would recommend putting formatting expectations into your prompts in a very specific way. I’ve found that clearly defining the expected prefixes, suffixes and delimiters not only allows you to generate cleaner variables for your code, but the models also perform multi-label classifications more effectively when you designate the format. Sort of like you’re defining multiple “containers” of generated content in the same request. This makes it more obvious to you AND the model when a particular container is empty.

For suggestions on good formatting instructions, I’d recommend GPT-4 via chat. Lay out your use case and the classification decisions you’d like the model to return, and ask it to suggest formatting guidelines that can be used to clearly differentiate the various aspects of the request.

If you’re using 3.5 turbo instead of fine-tuning a model, are you not providing hundreds of user validated examples to refine the results you’re getting?

@devbydylan The advantage of this approach is that you don’t have to. I’ll give you an example:

Let’s say you’re evaluating user reviews and you want to assign each review one value called “happy_or_mad” to indicate sentiment and one value called “purchased_item” that tries to identify what the user actually bought. Rather than feed it hundreds of validated reviews, you engineer a prompt the best that you can, then run it hundreds of times on real data or generated data. After reviewing the results, you make adjustments to the prompt to account for cases where it got one or more of the values wrong. If you can’t get it to the desired degree of accuracy after doing this a couple times, you create a separate prompt for the review model to look out specifically for those tricky reviews that your first model can’t seem to nail. Because this second prompt is focused exclusively on exceptional use cases, it shouldn’t have any problems identifying and correcting them. The advantage of GPT 3.5 is it super fast and cheap so you can afford to run each review twice. Makes sense? If you want to share your use case or something similar I’d be happy to help you think through it.