Fine-tuning classification and not classify data

Hello, I’m using fine-tuning model ada for multiclass classification user’s messages. I have about 10 categories. I use number of categories in completion, for example:
1 - message about rent
2 - message about work

{"prompt": "**APT For Rent**\uD83D\uDD25\nDowntown views 2 /T1\nLocation: Downtown \nType: 1 BR\nSize: 69 SQM\nFully furnished \nRent: AED 150k->", "completion":  " 1"}
{"prompt": "Hi,  I am in Dubai.  looking for a job in a Japanese restaurant.  work experience of about 10 years->", "completion":  " 2"}

How can i learn model to work with not classify data?
For example:
“In honor of the holiday, there is a discount in our supermarket from June 15 to June 28 for a number of products. Free shipping over AED 150 anywhere in Dubai” - message about discount. And other difference messages have completion 1 or 2.
These messages can be on completely different topics and I can’t anticipate all such messages to include them in the training.
But of the many messages, I only want the ones that best fit my categories. I have about 100 unique observations of each class in my training set. My model works fine on data that fits into one of the categories. But what to do with unclassifiable data I have more than 80% of them?

Have you tried passing the messages to say GPT3.5 and asked it to give you a classification? perhaps it can suggest new classes for you to add to your list.

A few approaches.

First the “easy one”, but may not work entirely …

Train on the “None” category, so create a "category": " 0", and send all your uncategorized training data to this.

The only problem, there isn’t anything specific or keyword-wise the AI can learn from. So there are doubts this will work, but it’s worth a try.

Another option, possibly more reliable, is using the Log Probabilities reported out for your already trained categories, and if they score high for the respective category, then use the category, otherwise consider it non-categorized.

For example (in Python):

e = 2.718281828459

# get the log probability (base e)
LogProb = FullReaction["choices"][0]["logprobs"]["token_logprobs"][0]
print(f"LogProb: {LogProb}")
                        
# convert the log probability to a probability in [0,1]
Prob = e**LogProb
                        
print(f"Prob: {Prob}") # if this is > 0.8 (or something high), then use it.

You can of course do both, i.e., train on ’ 0’ and use log_probs.

For a binary classifier, you would have parameters like so:

{"temperature": 0, "max_tokens": 1, "top_p": 1, "logprobs": 2, "frequency_penalty": 0, "presence_penalty": 0}

You need to call out “logprobs” in the API call to get them out. However, the max number of values that it will output is the top 5. If you look at these, for example, you will get the ranking of “confusion” for the top 5 tokens, or up to your top 5 categories if you set "logprobs": 5 in your API call.

Here is when I sent "logprobs": 2

"logprobs": {"tokens": [" 1"], "token_logprobs": [-0.07227323], "top_logprobs": [{" 1": -0.07227323, " 0": -2.6639233}], "text_offset": [42]}

You can also go with "logprobs": 2, like me, even for the multi-class case, but you will see the top 2 predicted tokens. The slight risk here is that the top 2 tokens may not map to your intended tokens, so going higher will catch these scenarios.

Lastly, if none of this works, you could train a binary classifier, and use it up front, where the training is ' 0' for things that are off topic, and ' 1' for things that are on topic. You would use logprobs here too to get an assurance of how certain the prediction was before proceeding.

Then only send the on-topic things to your current classifier. This filtering up-front is usually better, since it simplifies your logic downstream. But you could do this and all of the above to create a robust classification system.

1 Like

Thanck you so much! The option with the Log Probabilities works great. I’ll change my training data a little bit, and I’ll get the result I wanted.

1 Like