How to improve a fine-tune classifier?

I have about 3000 rows of labeled metadata pertaining to news articles, and am trying to decide if they are relevant to my application or not. The columns of metadata are: language, link, article title, publish date, a
list of “factors” important to me, a risk of triggers important to me, and the article source. The labels are ‘relevant’ or ‘noise’.

The factors and triggers include many categories, if I expand them all out it would be about 4000 extra columns. To make this dataset work w/ GPT-3, I smushed all of these columns into one ‘prompt’ column, prepending each record with the column title (in case GPT-3 can do something w/ that info). It looks like this:

LANG:es LINK:https://www.<redacted>.com TITLE:RD, Costa Rica <redacted> RISKFACTORS:['<redacted1>', '<redacted2>', ..., '<redactedn>'] TRIGGERS:<redacted1>,<<redacted2>,...,<redactedn> SOURCE:<redacted>

Notice inconsistent use of [], etc.

I know 3000 records is much more than what’s required, but I fed them all into an ADA model and got back an accuracy of 85%. I would like that to be a bit more, but I’m actually not sure about the labelling quality per se, so I’m also not looking for 100%.

What are the next steps to try to improve this model? Would going up to Curie or Babbage improve the result? Should I suppress some of the columns, or perform some transformations, or is it a “take it or leave it”? It’s a bit weird giving up control like this :smiley:

(also as a secondary question… I guess it is impossible to get any kind of feature importance or to understand what tokens were considered to decidide the label?)

Thanks for reading!

  1. GIGO applies. Clean up your data. Delete bad records.
  2. Samples are good, don’t worry about more.
  3. Try a larger model

I don’t really understand what you’re trying to do?

@daveshapautomator Thanks. Regarding the cleaning of the bad data… it’s not so easy… tagging being subjective and having several taggers, we don’t have a quick way to identify the tagging correctness… Only way I know how is assigning multiple people per article, which isn’t feasible ($$). Happy to hear any suggestions on that front.

I’ll try a bigger model as per your suggestion.


It’s a binary classifier: “relevant”/“noise”

1 Like

If this is the case then I have a few thoughts:

  • It’s possible that your classes are not clearly defined. This is usually the case if there are wide subjective discrepancies.
  • It’s possible that you do not have the correct classes. You may need more categories.
  • You can use multiple steps of GPT-3 to create accurate labels and then to verify said labels.

Please report back on if using a larger model made a difference in accuracy!

@daveshapautomator yeah, I mean the labels are “relevant” vs “not-relevant-in-one-of-1000-ways”, so necessarily there will be some granularity.

I wasn’t aware that I could use GPT-3 to SUGGEST labels, is there a way to do that? Or is this a situation of “my team relabels using 2/3 different schema and we test each one to see which performs better”?

@jhsmith12345 I did attempt a Babbage, almost no increase to any of the metrics… accuracy went from 85% to 85.7%

Just ask it. Use DAVINCI to synthesize higher quality data. I have quite a few videos with different techniques to synthesize data.

ooooh it’s you! lol… I used your recursive summarizer at work and blew everyone’s mind! Thank you, and keep doing what you do! I’ll check your videos to identify a label-proposer. And I suppose that after that step is done, we would then correlate the synthetic GPT-3 labels w/ my human ones?

1 Like


We’ve been working on tools at Humanloop to make finetuning and improving GPT-3 easier. Would love to chat and see if we can help. Could you please drop me an email raza at humanloop dot com.



You’ll have to experiment with it. If you have any kind of feedback mechanism you can just continually improve your dataset.

1 Like