Semantic Classification for Product Categorization: Approaches and Recommendations?

Hello, colleagues,

I apologize for any mistakes in translating to English. I’m seeking guidance and would be extremely grateful for any assistance you can provide.

To provide context: I am working on a system whose main objective is to categorize products sold in supermarkets. Currently, I only receive the barcode and the product description. Based on this data, I need to determine to which category the product belongs. Here’s an example:

Input:

{
"Code": "7896035700021",
"Description": "CAPSULA MELITTA INTENS.9 MARCATO C/10"
}

Output:

{
"Code": "7896035700021",
"Description": "CAPSULA MELITTA INTENS.9 MARCATO C/10",
"Category": "Coffee Capsules"
}

The challenge is that categorization is done manually, making it a strenuous task. So far, I’ve categorized approximately 2 million items. These are distributed across 520 distinct categories.

My plan is to develop a semantic classification model. I intend to represent product descriptions as vectors. Thus, for each new product, I would convert its description into a vector and determine its category using cosine similarity, comparing it with the already categorized products.

Given this approach, I would like to know: Is this a good strategy for this kind of challenge? Are there more efficient or recommended methods for this situation? As I’m relatively new to the Data Science field, I appreciate any insights or recommendations you can offer.

Thank you in advance.

1 Like

For now, leaving out what would work best, here’s a question: how many queries on the database are you expecting?

Vectorizing 2 million and 2 million more items guarantees a ton of AI calls before you have a working system.

Instead, an online system could augment (and store) on-demand with a more expensive call to a language AI:

Inspiring prompt:

Uncategorized product description “”“CAPSULA MELITTA INTENS.9 MARCATO C/10"”".

Output only the best choice from these 520 distinct categories: [“Coffee Capsules”, “Camping Equipment”, “Cleaning Products”…]

AI can then use its knowledge and inference to answer. AI could even Google UPCs to help answers.

The input you show may give poor results on semantic embeddings matching, especially without further AI decision-making but instead a direct technique like only algorithmic highest rank category from the top-k matches.

I wouldn’t do embedding including the barcodes, as it would increase matching on “strings of number ending in 9” or whatever else is also understandably similar.

2 Likes

Thank you for your response. I understand that initially, the demand will be high. However, including the categories in a single prompt is impractical, given that they total 5,000 tokens, amounting to 12,820 characters.

I tried using your prompt in the playground, but the results were below expectations: out of 10 tests conducted, only one was successful. Although I do receive the barcode, I do not plan to vectorize it. I intend to vectorize only the item description after sanitizing it and then store it.

I am aware that, once the implementation is complete, each new product will generate additional processing demand. My main uncertainty is whether the approach I’ve chosen is truly the best for my situation.

Hi!
What about creating the vector embeddings using a free method (like TF-IDF) and then running tests against it using your substantial existing database?
Nothing is better than facts, I suppose.

1 Like

Thank you for the response. I will research about this model you mentioned and conduct this test. It apparently scores the more “strong” words, so it might be an interesting approach for testing

A completely different path is to fine-tune a base AI model so that it doesn’t need a long prompt.

Consider 52000 inputs and outputs gives it a good idea the permitted categories, and you can reject and better prompt a return out of scope.


The starting point with a base model with no training (input bolded):

babbage-002:

This AI deduces and prints the best product category using a product description:
Description: CAPSULA MELITTA INTENS.9 MARCATO C/10
Classifier: CAPSULA MELITTA INTENS.9 MARCATO C/10\nOutput: CAPSULA MELITTA INTENS…

davinci-002:

This AI deduces and prints the best product category using a product description:
Description: CAPSULA MELITTA INTENS.9 MARCATO C/10
Classifier: CAPSULA MELITTA INTENS.9 MARCATO C/10\nProduct Category: Capsule Coffee Machines\nDescription: CAPS

gpt-3.5-turbo (with its base chat training):

system: This AI deduces and prints the best product category using a product description.
user: CAPSULA MELITTA INTENS.9 MARCATO C/10
assistant: Based on the product description “CAPSULA MELITTA INTENS.9 MARCATO C/10,” the AI deduces that the best product category for this item is coffee capsules.


Just the latter two that can be fine-tuned have some understanding. I wouldn’t know without a Googling myself, which could be injected input phrased properly.

1 Like

I had also considered fine-tuning, and your suggestion was very relevant. Especially since I had doubts about which model would be most efficient for structuring my items.
Considering 52,000 entries is still far from the ideal number for effective training, as there will be categories that might cause confusion, such as:

“Pineapple in Syrup Und” the correct category is “Fruit Compote”
“Diced Pineapple Und” the correct category is “Processed Fruits”
“Pineapple Candy Und” the correct category is “Fruits”

So, for it to fine-tune properly, I would need to train with many variations of descriptions, which might be more costly than implementing a solution based on vector distance. What do you think?

I can’t see how these descriptions could be made more useful without, as I posit, some lookup that improves the product description of all inputs.

Just the barcode search results page of Google can help a small-prompt AI improve internal description for either use (this forum is now #1 search result)

ChatGPT input (on gpt-3.5-turbo):

// Instruction Write quality English product description. // Search results: 2: GTIN/EAN: 7896035700021. 7896035700021 - CAFÉ EM CÁPSULA TORRADO E MOÍDO ESPRESSO MARCATO MELITTA CAIXA 50G 10 UNIDADES. Brand melitta.
3: Informar CEP. Mercearia Matinais Café Café em Cápsula Torrado e Moído Espresso Marcato Melitta 50g com 10 Unidades. 7896035700021_1_2_1200_72_RGB
4: Código Identificador (SKU), 7896035700021. Categoria, Café. Tipo, Café. Característica, Cápsula. Marca, Melitta. Ean, 7896035700021

Product Description:

Indulge in the rich and invigorating flavors of the Melitta Roasted and Ground Espresso Coffee Capsules. Elevate your coffee experience with the Espresso Marcato blend, thoughtfully crafted to awaken your senses. Each box contains 50g of finely roasted and ground coffee, meticulously packed into 10 convenient capsules.

Unlock the essence of coffee perfection with Melitta, a renowned brand known for its commitment to quality. Whether you’re a coffee connoisseur or simply seeking a delightful morning pick-me-up, these capsules offer a convenient and flavorful solution. The unmistakable aroma and depth of flavor will transport you to your favorite café in every sip.

Discover the joy of convenience without compromising on taste. Elevate your mornings, afternoon breaks, or evenings with the bold notes of Espresso Marcato. Each capsule is a symphony of carefully selected coffee beans, roasted to perfection and encapsulated to preserve their freshness.

Experience coffee indulgence like never before with Melitta’s Roasted and Ground Espresso Coffee Capsules. Unleash the extraordinary with every brew, and let the rich coffee notes dance on your palate. Make your coffee moments memorable with Melitta, a trusted name in coffee craftsmanship.

Key Features:

  • Premium Roasted and Ground Espresso Coffee Capsules
  • 50g of Exquisite Coffee in 10 Convenient Capsules
  • Crafted for Rich Aroma and Intense Flavor
  • Brand: Melitta
  • EAN/GTIN: 7896035700021

Awaken your senses and redefine your coffee journey with Melitta’s Espresso Marcato. Savor the moment, one capsule at a time.

1 Like