Classification API challenge

Hello to you reading this, and thank you in advance for your time/attention.
I’ve gone through +20 or previous posts on this topic and couldn’t find a satisfactory answer, hope I’m not duplicating unnecessarily.

Context:
Want to identify marketing-relevant job titles (e.g. 70-80% probability).

Input - have few million profiles that contain in 1st column job title e.g. senior colourist and 2nd column skills e.g. ‘post production’, ‘color grading’, ‘commercials’, ‘baselight’, ‘color correction’, ‘documentaries’, ‘film’, ‘broadcast television’, ‘online editing’, ‘digital cinema’.

Have another list of marketing job titles e.g. Communications and Marketing Manager and skills e.g. copywriting, editing, event planning, social media, public speaking, web content creation, newsletters.

Tried inputting small set initially, 2k examples of marketing label, and 2k of non-marketing.

Challenge:
Getting high match rate on irrelevant entries e.g. ‘office manager’ = 99% match to marketing. Hairdresser = 84%.

Initially I tried job title only, and label = marketing & non-marketing. This worked ok but there were some issues (90% match on ‘traffic warden’ presumably because of the job title ‘ad trafficker’ which was labeled as ‘marketing’.
Then I concatenated job title into the skills (as only allowed 1 input) so have a lot richer text input with same 2 labels. This performs worse than the first version which I’m assuming is because the tokens are split out and treated without context to the sentence string which makes up the job titles.

My question:
How the hell do I get this to work? :slight_smile:

1 Like

Hehe, if you’ve met as many marketers as I have you’ll find you are absolutely spot on with your first line!!

Ah that’s interesting. Didn’t think about using embeddings for this use-case. I’m looking at it for mapping out the landscape of skills (particularly capturing synonyms, misspellings etc’) but didn’t think about leveraging it for this use-case.

I have a call later with a data scientist (I’m a non coder & non scientist) as couldn’t quite fathom how to use the 2048 results returned for each input. I will raise this with him.

Thanks for taking the time to reply m-a.schenk, much appreciated.

1 Like