I’m currently trying to do some topic modeling on articles. I have many (40+) possible categories. I’m currently doing something similar to Recommendation_using_embeddings.ipynb. Using the
text-embedding-ada-002 model to create embeddings for articles and possible labels, and basically picking the one with the highest cosine similarity.
I’ve noticed that similarity models (
text-similarity-*-*) and text search models (
text-search-*-*-*) also exist.
What is the difference between using embedding models and these other options? They seem similar to what I’m using the current model for. Are their use cases different?
Ada-002 in this case, is a good match for your sematic similarity requirements within articles and labels.
Similarity models are specifically designed for finding the similarity between two pieces of text. They are typically trained on a dataset of labelled pairs of text, where each pair is labelled with a similarity score, they can be very accurate for specific tasks, but they are only applicable to those tasks for which they have been trained.
Search models are designed for finding documents that are relevant to a given query, they are typically trained on a dataset of documents, and can be used to rank documents according to their relevance to a query. Text search models can be very efficient for finding relevant documents, but they may not be as accurate as similarity models for tasks that require a more fine-grained understanding of the text.
Similarity models are specifically designed for finding the similarity between two pieces of text
I am at a loss how this is different than using ada-002 model to get embeddings and comparing similaritie using cosine?
In any case, I suspect ada-002 is a newer model so it’s probably best I stick with it?
Generally, the difference is the training and the way you access them, though there are so many variations at the moment it’s hard to create a “stereotype” but similarity tend to be encapsulated systems where you input 2 pieces of text and your get a similarity score as output, embedding models take an input query and then return the top (K) best matches from a collection of texts you embedded into a database previously.
One is a 1 to 1 the other is a 1 to many. Hope that makes sense.
Any time you want to pull back the text that is most similar to your input text, an embedding model will perform very well, if you want to ask “wow much is this string of text like this other string of text” then you should use similarity.