I have about 3000 rows of labeled metadata pertaining to news articles, and am trying to decide if they are relevant to my application or not. The columns of metadata are: language, link, article title, publish date, a
list of “factors” important to me, a risk of triggers important to me, and the article source. The labels are ‘relevant’ or ‘noise’.
The factors and triggers include many categories, if I expand them all out it would be about 4000 extra columns. To make this dataset work w/ GPT-3, I smushed all of these columns into one ‘prompt’ column, prepending each record with the column title (in case GPT-3 can do something w/ that info). It looks like this:
LANG:es LINK:https://www.<redacted>.com TITLE:RD, Costa Rica <redacted> RISKFACTORS:['<redacted1>', '<redacted2>', ..., '<redactedn>'] TRIGGERS:<redacted1>,<<redacted2>,...,<redactedn> SOURCE:<redacted>
Notice inconsistent use of [], etc
.
I know 3000 records is much more than what’s required, but I fed them all into an ADA model and got back an accuracy of 85%. I would like that to be a bit more, but I’m actually not sure about the labelling quality per se, so I’m also not looking for 100%.
What are the next steps to try to improve this model? Would going up to Curie or Babbage improve the result? Should I suppress some of the columns, or perform some transformations, or is it a “take it or leave it”? It’s a bit weird giving up control like this
(also as a secondary question… I guess it is impossible to get any kind of feature importance or to understand what tokens were considered to decidide the label?)
Thanks for reading!