I was working on a classification task for an indescriminant dataset of 150+ labeled “relevant” and “junk” website body-texts, and used openai’s embedding API paired with a random forest classifier to get an average f-1 score of 79%. I then ran each article through a simple gpt summarizer, and got embeddings of the summary instead, getting an f-1 score of 85%. I then went it even further with a more restrictive summary starting the same way every time with “this article is about”, and got an f-1 score of 87%. I think this works by “normalizing” the text fields so the embeddings can focus on the material differences of the texts more than superficial attributes like the length or vernacular.
Let me know if you’ve seen anything like this before! I found it very useful for my use-case and havn’t seen this technique done before, I’d be happy to share my data