How to improve a fine-tune classifier?

data · August 11, 2022, 10:31am

I have about 3000 rows of labeled metadata pertaining to news articles, and am trying to decide if they are relevant to my application or not. The columns of metadata are: language, link, article title, publish date, a
list of “factors” important to me, a risk of triggers important to me, and the article source. The labels are ‘relevant’ or ‘noise’.

The factors and triggers include many categories, if I expand them all out it would be about 4000 extra columns. To make this dataset work w/ GPT-3, I smushed all of these columns into one ‘prompt’ column, prepending each record with the column title (in case GPT-3 can do something w/ that info). It looks like this:

LANG:es LINK:https://www.<redacted>.com TITLE:RD, Costa Rica <redacted> RISKFACTORS:['<redacted1>', '<redacted2>', ..., '<redactedn>'] TRIGGERS:<redacted1>,<<redacted2>,...,<redactedn> SOURCE:<redacted>

Notice inconsistent use of [], etc.

I know 3000 records is much more than what’s required, but I fed them all into an ADA model and got back an accuracy of 85%. I would like that to be a bit more, but I’m actually not sure about the labelling quality per se, so I’m also not looking for 100%.

What are the next steps to try to improve this model? Would going up to Curie or Babbage improve the result? Should I suppress some of the columns, or perform some transformations, or is it a “take it or leave it”? It’s a bit weird giving up control like this

(also as a secondary question… I guess it is impossible to get any kind of feature importance or to understand what tokens were considered to decidide the label?)

Thanks for reading!

daveshapautomator · August 11, 2022, 11:52am

GIGO applies. Clean up your data. Delete bad records.
Samples are good, don’t worry about more.
Try a larger model

jhsmith12345 · August 11, 2022, 11:54am

I don’t really understand what you’re trying to do?

data · August 11, 2022, 1:11pm

@daveshapautomator Thanks. Regarding the cleaning of the bad data… it’s not so easy… tagging being subjective and having several taggers, we don’t have a quick way to identify the tagging correctness… Only way I know how is assigning multiple people per article, which isn’t feasible ($$). Happy to hear any suggestions on that front.

I’ll try a bigger model as per your suggestion.

@jhsmith12345

It’s a binary classifier: “relevant”/“noise”

daveshapautomator · August 11, 2022, 4:23pm

If this is the case then I have a few thoughts:

It’s possible that your classes are not clearly defined. This is usually the case if there are wide subjective discrepancies.
It’s possible that you do not have the correct classes. You may need more categories.
You can use multiple steps of GPT-3 to create accurate labels and then to verify said labels.

jhsmith12345 · August 11, 2022, 9:55pm

Please report back on if using a larger model made a difference in accuracy!

data · August 12, 2022, 8:25am

@daveshapautomator yeah, I mean the labels are “relevant” vs “not-relevant-in-one-of-1000-ways”, so necessarily there will be some granularity.

I wasn’t aware that I could use GPT-3 to SUGGEST labels, is there a way to do that? Or is this a situation of “my team relabels using 2/3 different schema and we test each one to see which performs better”?

@jhsmith12345 I did attempt a Babbage, almost no increase to any of the metrics… accuracy went from 85% to 85.7%

daveshapautomator · August 12, 2022, 11:19am

Just ask it. Use DAVINCI to synthesize higher quality data. I have quite a few videos with different techniques to synthesize data.

data · August 15, 2022, 10:15am

ooooh it’s you! lol… I used your recursive summarizer at work and blew everyone’s mind! Thank you, and keep doing what you do! I’ll check your videos to identify a label-proposer. And I suppose that after that step is done, we would then correlate the synthetic GPT-3 labels w/ my human ones?

raza · August 15, 2022, 11:05am

Hi,

We’ve been working on tools at Humanloop to make finetuning and improving GPT-3 easier. Would love to chat and see if we can help. Could you please drop me an email raza at humanloop dot com.

Thanks!

R

daveshapautomator · August 15, 2022, 12:07pm

You’ll have to experiment with it. If you have any kind of feedback mechanism you can just continually improve your dataset.

Topic		Replies	Views
RLHF after Fine-Tuning Davinci? API	7	1900	February 21, 2024
Prompting GPT3.5 for NER data labeling Prompting gpt-4 , gpt-35-turbo , chatgpt	18	4119	January 25, 2024
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1781	December 18, 2023
Help me decipher my very first fine tuning report? 🧐 API gpt-4	12	102	November 1, 2024
Prompt Assistance , Potentially Fine Tuning oddity Prompting	6	1175	February 7, 2023

How to improve a fine-tune classifier?

Related topics