GPT3 Finetuning for Multilabel Classification

Hi all,
has anyone used GPT3 finetuning for multilabel text classification?

I would highly appreciate to see an example how to structure the completion and learn from best practices.

For clarification:
My ask is not about the common (multiclass) classification use case.
Multilabel means that each instance has 1 or more labels associated.

Thank you


@akopp Find any good resources here? I’m poking around myself. If I find anything, I’ll post it here as well.

In my case I was able to achieve good results using entity extraction similar to this example: OpenAI API

Not sure if this is generally a good approach for multilabel but in my scenario it worked quite well.

I don’t see why you wouldn’t be able to use their finetune classification example for multiclass… You would just feed multiple classes into the training data, right? Give it a go!

I am also interested in how we can fine-tune the model for multi-label classification. If anyone has figured it out let me know please or if there is any notebook on how to do that?

1 Like

@saraahmadi321 @akopp @aaron5 have you guys found any solutions? I want to do something similar where i want to multi-label customers feedback. Thanks in advance!

1 Like

You can create a fine-tune by creating a jsonl file with example prompts and completions. So a file with hundreds or thousands of lines like this:

{“prompt”: “Your company is awesome.\n\n###\n\n”, “completion”: " 1"}
{“prompt”: “Your company is bad.\n\n###\n\n”, “completion”: " 0"}

Note the space before the completion output. Also, the '\n\n###\n\n" is what I used as a stop word, which is the same one recommended by OpenAI.

Then to create the fine-tune, follow along with the docs:

Here would then be the parameters you would send when you call your fine-tuned model. The important ones being the temperature of 0, and in my case, since I am just outputting single character labels, I set max_tokens to 1.

{“temperature”: 0, “max_tokens”: 1, “top_p”: 1, “logprobs”: 2, “frequency_penalty”: 0, “presence_penalty”: 0}

Be sure to append the stop word of ‘\n\n###\n\n’ to the input prior to sending it off to the trained fine-tune.

Note: You can only fine-tune base models right now. So ‘davinci’ and not ‘text-davinci-003’.

For fine-tunes with a lot of training data, you can get away with using ada or babbage for classification. But you can train various versions and see which ones work better for you. The higher end models such as curie or davinci would be used if you have limited training data.

That’s basically the high level advice on creating a fine-tune.

Good luck!


This is great advice. I was playing around with fine-tuning as well for this. If you limit the max tokens to 1 though isn’t that still a multi-class instead of a multi-label model? Even with fine-tuning I’ve been struggling to get the model to spit out the correct set of labels using the TRAINING data… sigh.

Which model did you fine-tune?

How many prompt/completion pairs did you provide it for training?

Did you include the stop token?

The classification would map to single character output so, ‘0’, ‘1’, … ‘a’, ‘b’, … ‘z’

Then you map this to what the meaning is.

To map one thing to two or more different categories, you need to run two or more models against the same input. But the same idea.

Oooooooooh I think I see what you mean. I can map a single character to multiple labels. That’s clever! The data-prep CLI tool recommended I use ada so that’s what I went with. I did NOT include the stop token (which I will now do). I fed it ~200 prompts with 5 labels/completions. Even with the temp at 0 I was getting some wild outputs.

1 Like

With 200 training points, you might want to bump up to babbage or curie (and davinci as a last resort). The higher models might be able to “learn quicker” since they have more parameters.

Otherwise beef up your training data if you want to use ada and the stop token doesn’t fix it.

Hi Curt,
I am doing multiclass classification with 230 labels and 120k training samples. What model would you recommend?

I would start with Babbage and go up if the results aren’t good. But that is a ton of labels. Maybe break it up into several course classifier groups up front and further refine with additional classifiers.

1 Like

@dml1002313 Did you solve your issue. I’m looking to perform multi-label classification of text into 7 labels. None, one or many labels can apply to the text, I already have my dataset in the format : ‘this is a sample’,0,1,0,1,1,0,0,1 and trying to work out how best to structure my dataset for fine-tuning GPT3 so that I can get a response such as :
prompt: this is an example test
completion: 1,1,1,0,1,0,0,0

1 Like