Using the new fine-tunes endpoint for binary classification

Hello, I have run into problems trying to migrate from the old soon-to-be-deprecated fine-tunes endpoint to the new one.

I have previously managed to fine-tune the base ada model to tackle the task of binary classification. I have followed the official cookbook /tried to include a link but it doesn’ let me/ and it worked perfectly. My training/validation data is in the following shape:

{"prompt": "some text \n\n###\n\n", "completion": " 0"}
{"prompt": "some other text \n\n###\n\n", "completion": " 1"}

To fine-tune the model, I have run the following command using the CLI:

openai api fine_tunes.create -t "<file-train>" -v "<file-validation>" -m ada --compute_classification_metrics --classification_n_classes 2 --classification_positive_class " 1" --n_epochs 1 

To get completions, I run the following python code:

openai.Completion.create(model="my-model-id", prompt=prompt, max_tokens=1, temperature=0, logprobs=2)

This would give me logprobs values for classes " 0" and " 1" for each prompt, which is exactly what I expected to get.

Now, since the original endpoint and base models are getting deprecated soon, I have tried to follow the same procedure using the new base model babbage-002 using the new fine-tuning endpoint, as recommended. However, I am missing in the API specification for the new endpoint the classification-related parameters , that is

  • classification_n_classes
  • classification_positive_class
  • compute_classification_metrics

I have tried omitting these parameters and train the model using the following python command:

from openai import OpenAI
client = OpenAI(api_key="my-API-key")
  model = "babbage-002",
  training_file = "file-train-id",
  validation_file = "file-validation-id",
  hyperparameters = {"n_epochs": 1}

Then, I get the completions using the following code:

client.completions.create(model="my-model-id", prompt=prompt, max_tokens=1, temperature=0, logprobs=2)

However, the result contains logprobs for nonsensical classes, such as

    {' ': 0.0, ',': -24.314941}
    {' ': 0.0, ' a': -21.548826}
    {' ': 0.0, ' new': -24.072266}

If I try to set logprobs=4, I get something similar to

{' ': 0.0, ' new': -24.072752, ',': -24.249998, ' no': -24.59448}

What especially baffles me is the fact that my prompts are not even in English, while the completions are either a blank, comma or an English word.

Hence the question is, how can I use the new fine-tuning endpoint in the way I could use the old endpoint for the classification task? More specifically, is it possible to pass the classification-related parameters specifying the number of classes etc. to the new endpoint?

Thanks in advance for any insight!

I have the same problem. Is there any solution for fine tuning classification problem?

It’s related to the new tokenizer, so to fix this stuff, try the direct route.

Change from this:

{"prompt": "some text \n\n###\n\n", "completion": " 0"}
{"prompt": "some other text \n\n###\n\n", "completion": " 1"}

To this:

{"prompt": "some text", "completion": "0"}
{"prompt": "some other text", "completion": "1"}

And I just used the online GUI for fine-tuning instead of the command line.

1 Like

I’ve followed your advice @curt.kennedy, but I am not getting either the two training classes back as completions when I call the API (instead, other possible completions like “I”).

My classes are not “0” or “1”, but two separate strings (“Spam”, “Good”).

The model training data was like this:

{"prompt":"i definitely love the music. also the starting was perfect. its good for dancing classes. the flow and rythem of the song is pretty. but it would be bettr if it had flactuated more and also it was boring cause they repeated the lyrics for several times","completion":"Spam"}

I call the model with the following parameters (as I did before on the old API which worked perfectly):

  prompt : "This is a lovely piece.",
  model: "myfinetunedmodel",
  max_tokens: 1,
  temperature: 0,
  logprobs: 2

Also, I used the fine-tuning GUI, but I note, I didn’t specify anything about this being a classification problem. The old API made an inference that it was a classification problem and told you so, setting up and preparing the data accordingly. Perhaps I am missing a step like this?

The problem is that “Spam” is 2 tokens, and you are only allowing 1 token in the output.

Try changing max_tokens to 2, and see if that helps.

The other problem is that if you plan on using your logprobs, you need to combine the probabilities of “Spam” since it is 2 tokens (“Sp” + “am”). So it’s a bit of a mess.

This is why I shoot for 1 token output, and then map the tokens to the desired label. I believe the integers 0 through 999 are all 1 token entities, and could be used as indices into your classification labels. And the labels are free to change/update as they aren’t burned into the training data for the fine-tune.

You can also solve this by lower casing to “spam”, which is 1 token, but this involves a re-train.

Thanks @curt.kennedy . I retrained the model like you said with “0” or “1” as the completion. It’s definitely better and usually predicting “0” or “1” for my prompts, but sometimes with shorter prompts it is predicting “,” or “.” - definitely not as reliable as the last API where you could be sure that you would only get your class labels back.

Any idea how to overcome this?

Probably the best way is to train the model on these shorter prompts. You can always add to a fine-tune by selecting the fine-tune model name when you come up with new JSONL data that needs training.

If these shorter prompts are junk, then you should just filter them out, or try normalizing your input before the fine-tune.

Also, worst case, look at log probs and see if “0” or “1” is in there, and pick the one with the highest log prob.

Excellent advice, thank you. Could you clarify what you mean by “try normalizing your input before the fine-tune”?

Sure … basically get rid of extraneous spacing, odd capitalizations, etc.

So use a combination of REGEX and lower case / upper case to clean things up.

Also, one thing I forgot, is if you have a low amount of training data, say less than 500 examples, I remember trying to use a higher model like DaVinci, since it seemed to learn more from the limited data.

Then after you accumulate more training data, you can go back down to Babbage with this larger training set.

So normalize, go with DaVinci, or beef up the training data and then go to Babbage.

Thanks for the advice. But this kind of untidy/weird input is what I would want as a signal for predicting spam in this instance (well, it’s not quite spam we’re predicting, but low-quality language.).

Training data is just over 3,000 samples as proof of concept, but we have access to vastly more than this if need be.

Besides fine-tunes for classification, I recently posted about embedding classifiers over here.

Embeddings are bombproof and will accept any non-empty text, garbage or not. Plus data can be added and subtracted on the fly.

Something else to look into, and maybe a better fit for the Spam/Ham detector. You could always run both and do some weighted average between them too.