Can we use OpenAi Embeddings (ada) for Similarity / duplicates on articles?

,

Hello everyone!
I want to build a feature to find potential duplicate articles in our database. This should work similarly like “Your topic is similar to” of this platform :slight_smile:

We have a database with 1000+ articles. An article has a title and a body. Per article, there are, on average, 200 words. I want to find potential duplicates. We want to do this with similarity. I believe Ada can do this, but I’m unsure if it can handle the whole article to identify the similarity. Will it get the right result? I can find examples with only titles, but that is not enough for us. Are there examples how to do this with success? Do you have any other thoughts you can share on how to do this well?

Welcome to the community!

That is a bit of a loaded question!

I would say yes, yes it can. However, while you can certainly load the whole articles into Ada, Ada will only look at the beginning of the article. Title, abstract, maybe part of the first paragraph if you’re lucky. All the other tokens are pretty much ignored.

While this sounds pretty bad, I would still mention that in most cases, that’s enough for use-cases similar to yours.

the text embedding 3 models, on the other hand, have a better track record of looking at the entire text. However, they have a tendency to fixate on certain topics - if your articles have are more complex and contain multiple topics, you might run into unexpected issues with the new embedding models.

In any case, a summary => embedding seems to work best, but is considerably more expensive.

I would suggest you do a test run with plain old ada and see how it goes. After that, you can always upgrade to more sophisticated methods.

2 Likes

If you are looking at duplication, or near duplication of articles, look at the fuzzywuzzy library in python.

It’s basically string similarity (not semantic) but will run very fast and cheap for lots of data…

2 Likes

Thank you for your response! So the best way is to take the title and for example the first paragraph to save tokens and another reason is that they might not be used at all.

Our topics are most of the time focussed on one topic i think. Depends what a topic means in your explanation. In this example i describe also a issue i have in 1 topic. Or do you mean something else?

Not sure - its been around a while so I am sure something similar is out there.

I meant like if you try to embed a section of a novel that jumps between two narratives, you might run into trouble. Or you try to embed an article that deals with nocturnal habits of ant-eaters and then jumps to talking about the 2024 election.

I wouldn’t really worry about that - it’s a micro optimization. embedding costs next to nothing - so just throw in the whole thing before trying to fiddle around with separation.

2 Likes

We solved this with the use of OpenAI Embeddings, and it is performing well. Thanks for all the help!

1 Like