How good is Davinci with Text Classification?

Hello Community,

I’ve got about 3 years general NLP experience, having worked extensively with entity-intent based NLU, worked a bit on ML NER and text classification models, and of late, read a lot about transformer technology and played on OpenAI playground for the past week. On the other hand, I’ve been programming for 20 years and feel very comfortable with “traditional” rule based AI.

In my bits and pieces of experience, ML models are better at extracting entities than classifying large bodies of text. Sure, sentiment analysis on 4/5 sentiment labels is reasonably “easy”, but in the past, when it comes to classifying text against about 30 labels, it becomes increasingly hard to train models that reach, say, 98% accuracy. Yet, the same problem, when extracting entities and logically deducting a text category based on which entities where identified seems to be a simpler and more effective solution.

If you had 30 categories you needed to perform text classification on, and the text that needs to be classified is between 1 and 6 sentence paragraphs, yet you only have a hand full of examples per category, 15 at most, with many categories only having one or two examples, plus the categories/examples aren’t exactly “crisp” as categories over lap (like Rain, Flood, TropicalStorm, Hurricane) which often produce texts that even human beings can spend hours arguing that “Excessive rain and wind damage due to Hurricane Harvey resulted in river banks to overflow…” aught to fall under “Flood” or the proximate cause is the “Rain”, or the main event is the “Hurricane” etc. Given the amount of training data and the complexity of the task, would OpenAI’s Davinci engine reach 98% accuracy?

My thoughts are to simply to identify causes, therefor extract multiple causes per text, and use a rule based system that defines which categories apply for given combinations of causes, for example Hurricane + Flood = “Flood”, yet Hurricane + Rain = “TropicalStorm” and Hurricane without (Floor or Rain) = “Hurricane”.

Playing with Currie and Davinci, I’m starting to think logic isn’t it’s main strength and in some circumstances better left for “normal” code. Anybody share this thinking with me?

How wrong am I? I’d love to hear other people’s thoughts


Davinci is very good at language tasks. I would be surprised if breaking the task down into entity extraction, follow by rules would give you a better performance, but I’d be happy to be proven wrong. You’re right - logic is not davinci’s strength.

If there’s no way to get interhuman agreement to be as high as 98%, then no system can achieve >98% accuracy, as long as labels are based on how humans label them. Maybe a tagging system is more appropriate? You could potentially ask davinci-instruct to tag the text in a comma separated ways and choose out of the given 30 tags.


Thank you @boris, I feel like trying your suggestion.

For this particular problem, the human accuracy can as bad as 60%, or users tend to select “Other” or scan the list of options and pick the first “best” label that sort-of match, which is why the correct examples are so few. The goal isn’t to mimic humans, it’s to solve the issue of humans classifying incorrectly. Yet, when you define business rules as previously described, that clearly describe what each label is for, a machine can definitely follow the rules at a very high level of accuracy, I’ve built such systems, it just takes long, which is why I’m here. An extension to this problem is where, during QA or once the system goes live, the client or testers start saying “oh you know that rule about flood + rain… can we change it to rather state xyz?” If the rules are in a simple decision table, the change is 1 minute. If the “rules” are inferred by ML, does the change involve changing the training file and retraining? or would it be as simple as to change your instructions for Davinci-instruct? Makes me want to go try it and see, but I lean towards thinking having explicitly defined and changeable business rules is more practical.

I suppose my other reason for holding on to entities is the fact that I deal with multiple clients with the same problem, but the labels change for one client to the next, so have 12, some have 50, all fundamentally requiring the same functionality. If I had one entity extraction model that solves all these problems, but a different set of labels and rules for each client, would this be better (in terms of development time and accuracy) than building a new model for each client?

And my apologies if I seem argumentative, I’m actually wrestling with these questions myself, I don’t want to waste time experimenting too much and don’t want to make bad decisions and have to rework solutions down the line.

For the particular project I describe, we also use the entities in multiple places in the system, knowing what questions to ask, what information to retrieve, calculations etc. I have lots and lots of data for entity extraction in the form of regression tests too. The point of switching to GPT-3 is to add flexibility and improve the system’s NLU components, I’m just scared I’m holding on to old fashioned ideas because my thinking hasn’t quite adapted yet.

1 Like

The following example is super simple in comparison. The texts I was referring to is typically much longer, but these examples where fabricated to show the ambiguity.

Tag text with the following set of labels:

Description: Hurricane Harvey caused significant damage to as rivers overflowed and flooded the construction sight.

Description: Heavy rains caused flooding

Description: Weight of rain caused building to collapse

The problem you mention of several different clients wanting to label essentially the same problem in slightly different categories is common. Did you try the search endpoint? The search endpoint can give you a similarity to each of the categories, where you can describe each category with natural language.

@boris, brilliant! I scanned over the section on the search endpoint but thought it was a type of question-answer so didn’t give it enough attention. I’ll definitely try it. Thank you

The 3rd document was meant to disqualify hurricane damage, “not caused by hurricane” but instead, the query scored highest on the 3rd document.

-H “Content-Type: application/json”
-H ‘Authorization: Bearer xx’
-d ‘{
“documents”: [
“damage due to hail”,
“damage caused by Hurricane”,
"Damage caused by hail but not caused by hurricane damage",
“Damage caused by theives but not caused by rioters or vandals”,
“Damage caused by rioters”,
“Damage caused by water but not caused by sewage backup”],
"query": "Hurricane and hail damage"

“object”: “list”,
“data”: [{
“object”: “search_result”,
“document”: 0,
“score”: 295.784
“object”: “search_result”,
“document”: 1,
“score”: 232.972
“object”: “search_result”,
“document”: 2,
“score”: 402.171
“object”: “search_result”,
“document”: 3,
“score”: 77.785
“object”: “search_result”,
“document”: 4,
“score”: 72.773
“object”: “search_result”,
“document”: 5,
“score”: 195.766
“model”: “davinci:2020-05-03”

1 Like

I’ve tried it for classification before and it seems overly creative. For context, I was using it to summarize interviews on social media use and it would create classifications like “cuteness” and “algorithms” that a human wouldn’t. It’s not wrong, but it’s not ideal behavior for classification.

However below davinci, it seems to be quite inaccurate with things like interview transcripts. For that situation, someone was willing to pay though, and the value was far higher than the cost, so it didn’t matter.