Can we use OpenAi Embeddings (ada) for Similarity / duplicates on articles?

s.kop · February 27, 2024, 8:34am

Hello everyone!
I want to build a feature to find potential duplicate articles in our database. This should work similarly like “Your topic is similar to” of this platform

We have a database with 1000+ articles. An article has a title and a body. Per article, there are, on average, 200 words. I want to find potential duplicates. We want to do this with similarity. I believe Ada can do this, but I’m unsure if it can handle the whole article to identify the similarity. Will it get the right result? I can find examples with only titles, but that is not enough for us. Are there examples how to do this with success? Do you have any other thoughts you can share on how to do this well?

Diet · February 27, 2024, 8:46am

Welcome to the community!

That is a bit of a loaded question!

I would say yes, yes it can. However, while you can certainly load the whole articles into Ada, Ada will only look at the beginning of the article. Title, abstract, maybe part of the first paragraph if you’re lucky. All the other tokens are pretty much ignored.

While this sounds pretty bad, I would still mention that in most cases, that’s enough for use-cases similar to yours.

the text embedding 3 models, on the other hand, have a better track record of looking at the entire text. However, they have a tendency to fixate on certain topics - if your articles have are more complex and contain multiple topics, you might run into unexpected issues with the new embedding models.

In any case, a summary => embedding seems to work best, but is considerably more expensive.

I would suggest you do a test run with plain old ada and see how it goes. After that, you can always upgrade to more sophisticated methods.

drfalken · February 27, 2024, 10:46am

If you are looking at duplication, or near duplication of articles, look at the fuzzywuzzy library in python.

It’s basically string similarity (not semantic) but will run very fast and cheap for lots of data…

s.kop · February 27, 2024, 12:26pm

Thank you for your response! So the best way is to take the title and for example the first paragraph to save tokens and another reason is that they might not be used at all.

Our topics are most of the time focussed on one topic i think. Depends what a topic means in your explanation. In this example i describe also a issue i have in 1 topic. Or do you mean something else?

drfalken · February 27, 2024, 5:19pm

Not sure - its been around a while so I am sure something similar is out there.

Diet · February 27, 2024, 8:53pm

I meant like if you try to embed a section of a novel that jumps between two narratives, you might run into trouble. Or you try to embed an article that deals with nocturnal habits of ant-eaters and then jumps to talking about the 2024 election.

I wouldn’t really worry about that - it’s a micro optimization. embedding costs next to nothing - so just throw in the whole thing before trying to fiddle around with separation.

s.kop · March 25, 2024, 1:40pm

We solved this with the use of OpenAI Embeddings, and it is performing well. Thanks for all the help!

Topic		Replies	Views
Should I use Embedding to search for duplicate reports? API embeddings	4	998	December 17, 2023
How I cluster/segment my text after embeddings process for easy understanding? API	13	13197	December 18, 2024
Reducing Cost of GPT 4 by using embeddings Prompting	23	10603	May 4, 2023
Embedding past conversation data for context memory & retrieval API	8	2532	January 8, 2024
`text-embedding-ada-002` API	23	17093	February 6, 2024

Can we use OpenAi Embeddings (ada) for Similarity / duplicates on articles?

Related topics