Sentence Classification solution

I have a product which is suppose to get text from different sources, sales calls, docs, webinars… and in that text to find sentences which belong to specific categories.

For each category we can give sentence examples. Things complicate a bit when we want to allow users to give their own sentence example and create their own rules for categories.

We started using gpt-4 API with okish results. But now prompt becomes larger as we try to feed it with more examples for each category and slow, ~15s for analyzing 300 words of text and it gets too expensive to afford it and I’m not sure we need a LLM for our case.

We are trying to move away from gpt maybe and find less expensive, faster, more reliable solution.

What options do we have ? After talking with some people we consider the following:

  • Spacy
  • SetFit - few shots learning ( gets complex when people start adding their own rules )
  • LoRa - training our own llm ( gets complex when people start adding their own rules )
  • LLama Index (with GPT-3.5-turbo ) - RAG - fixes the problem with people having their own rules, but is it a good fit for this use case ? still dependent on GPT, slow, limited requests/second.

Thank you in advance for any opinion or idea on this one.

Why not use embeddings?

You put a chunk of text in, get a vector out.

Then you compare how close the vectors are, the closer they are, the more similar semantically the text is.

It’s also way cheaper than LLM’s and many embedding models to choose from these days.

5 Likes

OpenAI currently offers three types of embedding models.
For each 1M tokens, the prices are as follows:

  • text-embedding-3-small/$0.02
  • text-embedding-3-large/$0.13
  • text-embedding-ada-002/$0.10

It is believed that a higher number of dimensions can embed more information, but as the number of dimensions increases, it is important to be aware that the computational complexity required to search the vector data after it has been stored also increases.

Personally, I think that the well-known text-embedding-ada-002 is good enough.

Also, since the saved vector data cannot be restored to the original text, it is necessary to use tools like Chroma or FAISS to save the vector data and text data in pairs.

When you get related text, you can perform a semantic search using cosine similarity (or Euclidean distance) by converting the text to vector data with the same embedding model.

The closer the numerical values, the closer the meanings.

It is important to note that the question text and the text used for embedding are vectorized with the same number of dimensions.

In other words, whether you vectorize a short question or a long text, it is important to note that the same dimensional vector will be obtained if the same embedding model and its variant are used (In the latest embedded models, the basic dimensions are 1536 for small and 3072 for large, and the embedding-ada-002 has its own unique dimension of 1536.).

For example, even if you embed the word “cat”, it will be converted to a vector with the number of dimensions specific to the embedding model.


My answer may be unreliable, so if there are any errors, I would appreciate it if someone could point them out.

2 Likes

Thank you for the detailed response.

Wondering what would be the amount of data needed for each category to get good results with this approach? Right now we can add maybe hundreds, but from your experience that will get back good results ?

2 Likes

For classification, it depends on how many categories you want to classify.

Using an embedded model instead of an LLM for classification can be very cost-effective, but it requires more samples because the knowledge contained in the LLM is not available.

In general, the more categories you have, the more samples you will need.
But even with hundreds of examples, this approach can give good results.

With relatively little data, if there are clear differences between categories, or if the pre-trained embedding model (although I mentioned the OpenAI embedding model here) is highly relevant to the target domain.

Actual results are unknown without experimentation, but if you have hundreds of examples, I recommend trying the embedded model with that data first.

If you have enough data that matches the data to be classified, the classification will work well.

Then, if the accuracy is not good enough, you may want to consider increasing the amount of data you have.

Happy embedding!