I’m trying to classify a prediction and extract some entities from it from a niche domain. First tests on classification and extraction in two different steps have inspired much confidence, but I’m looking for the best way to do it:
For example, lets assume i have a number of generic predictions about drinks:
“Tee is always way too expensive”
“I think black gold costs more than 5 dollar in starbucks”
“Orance juice tastes ugly at mc donalds”
I wanna categorize these sentences within a predefined set of drinks {tea,coffee,orange,apple,water,…} … importantly the categories cannot be extracted via entities as they might be implicit, the example above is fictional.
And extract a set of optional entities, eg “restaurant”, “price”, “taste”.
Normalize these entities so “mc donalds”, “mcdonalds”, “mcci” all come back as “Mc Donalds”
How would I best go about this? would a pretrained model with:
{“prompt”:“Orance juice tastes ugly at mc donalds–>”, “completion”:" Category: orange\nPrice:\nTaste: ugly\nRestaurant: “Mc Donalds”}
work ?
Or would i train 3 different models to do this in 3 steps ?
And finally, how can i tell a pre-trained model to categorize within a finite set of categories (and not invent new ones)? do i have to add the list of categories to the prompt all the time?
i need a real example
short answer:
fine tuning barely works unless you have 10k of good perfect examples AND you are trying to do something far simpler and it is not going against the trained corpus “opinion” for example you said “tea is always too expensive” if this is not the view of the corpus, you will have maximal trouble making it remember this effectively
I have been working on a very similar task for a few days.
My project is to extract categories & tags from Google business reviews for each business in a list AND summarize the customer sentiment/likes/dislikes for the business.
The sole reason my client desired to do the all aspects of the task with one prompt was cost.
There are 260,000 reviews so it is obviously more economical for him if all extraction/summarization can be done with one pass through the data.
My tests so far show that it is possible to complete the task with a zero shot prompt HOWEVER, the quality of the output is superior if a separate custom prompt is used for each part of the task (2X AI cost).
It’s a trade-off, if you want lower cost you will also need to settle for lower quality, because if you ask the prompt to do several things at the same time the quality will suffer (in my experience).
thanks a lot for sharing, this is much appreciated.
do you mind sharing the prompt you use for the categorization? do you just input like “[review] ->” or something like “The following is a google business review and it’s category:\n[review] ->” ?
Fitnesscentrecomparison.com lists the applicable categories, facilities, equipment, and classes for every fitness centre in Australia. Visitors to the website can view the centre’s categories, facilities, equipment, and classes.
The following google reviews are for Anytime Fitness gym, 251A Morphett St, Adelaide SA 5000, Australia, and mention all of the categories, facilities, equipment, and classes, for this specific fitness centre. Website visitors can read the profile page and decide if they want to become a member. The name of the expert is
Fitness Expert
Hi Fitness Expert, please read through the following fitness centre reviews to determine ALL of the categories, facilities, equipment, and classes for this location, so we can help prospective customers decide if they want to become a member.
Reviews:
1
2
3
4
5
6 etc etc
End of reviews. Here are all of the categories applicable and the specific facilities, equipment, and classes mentioned in the reviews:
Yes, you can process the reviews from excel. but much easier to use Google sheets with a Google sheet addon(performs API calls to Open AI/Anthropic/Mistral/Groq etc) to perform the processing).
Then you can have a results column which contains your required data.