Passing instructions only once for a message classifier use case

likhitha · April 20, 2024, 5:26am

Building a message classifier with gpt4 API. Currently using prompt with detailed category/label definitions to classify a message.

However, the problem is instructions (definitions in this case) amount to ~6k tokens and must accompany every user message, thus raising the cost. Even with assistant API, this problem persists. Tried out batch processing but observed performance degradation. I cannot consider even fine-tuning as there is not much-labelled data available.

Aware that there is no memory option as such while using API to store the instructions. Is there any better approach that I can try for cost optimization?

PaulBellow · April 20, 2024, 5:30am

Heya! Welcome to the forum.

Have you tried using a multi-shot with GPT-3.5-turbo or GPT-3.5-turbo-instruct? I’d try with the latter. Maybe start with 10-shot then work your way down. Shouldn’t need too many.

If you share the prompt, we might be able to help simplify/improve that as well.

There’s a great Prompting category here too.

curt.kennedy · April 20, 2024, 5:33am

With a small amount of data, look into creating a fine-tuned model.

With a large amount of data, I would use correlation and labeled embeddings.

likhitha · April 20, 2024, 5:35am

Thanks for your quick response.

Tried with gpt3.5 turbo instruct earlier but as no.of categories increased due to token limit issue switched to gpt4. The prompt contains short definitions of about 250 different categories. Few shot would take even more tokens as there are so many categories (~250).

likhitha · April 20, 2024, 5:39am

Yeah if there is no other alternative, have to work on getting some labeled data and using fine-tuned model

PaulBellow · April 20, 2024, 5:39am

Hrm. Maybe split those 250 into smaller batches of similar categories and find out what your user request is using a cheaper / faster model to give it a finer-detailed category. That way you’re not sending the entire 250 each time even when you don’t need to send ALL of them likely?

Datasets are gold these days… especially clean/edited ones…

likhitha · April 20, 2024, 5:44am

Currently there is a hierarchy for these categories from primary(high level) to tertiary (much detailed one’s ). As you mentioned I will try with a batching approach.

Thank you!!

PaulBellow · April 20, 2024, 5:48am

No problem.

If it works or not, please come back to let us know.

We’ve got a great dev community growing here, and we’d love to have you part of the group.

@curt.kennedy’s got a great breakdown too.

Good luck and happy coding!

ETA: There’s a 16k context gpt-3.5-turbo i believe (not instruct sadly!) but if you can recode that bit for ChatML, that’s another option…

curt.kennedy · April 20, 2024, 5:48am

250 categories is challenging. You need a lot of data in any case to distinguish each category for either the fine-tune or embedding approaches.

The data will either have to come from a human or a highly reliable model.

But data is everything.

But like @PaulBellow mentions, you can try a traditional and proven divide and conquer binary tree approach, if that makes sense for your categories.

This works because forking each branch is only 1 bit, and requires less date to train, but you end up cascading more models to form the entire tree. So it’s a trade.

PaulBellow · April 20, 2024, 5:52am

Yeah, I would keep it to 1 or 2 hops with Ada maybe even? Or is GPT-3.5-turbo faster now? Most I’m using in production is one low-cost call then the real call… Even on tier 5 billing, my long 2k+ GPT-4 prompts take a long time on peak-hours, so I stick to 3.5 for as much as I can.

For the record, I’m on team #gpt4-instruct, though I’m not sure we’ll ever see that model as it’s considered “legacy” at the moment.

Still a lot of improvements overall to the architecture, though, especially on the sometimes easier to understand Assistants side. The batch for cheaper is cool too, and I want to put that to use soon.

So many improvements!

sps · April 20, 2024, 6:07am

Welcome to the OpenAI community @likhitha

You could use embeddings to cluster the labels.

Once that’s done find the closest cluster to your data to be classified.

Then run classification only for that cluster with gpt-4 or turbo.

curt.kennedy · April 20, 2024, 6:07am

Since there is a hierarchy of categories, say red, orange, blue, violet.

You train on warm vs. cold colors, so red/orange vs. blue/violet.

The once you get this result, you break down further, so train on distinguishing red vs. orange with one model, and distinguish blue vs. violet with another model.

So you have 3 models total. One to break down warm vs. cold, and 2 follower models to break down further depending on where the first branch occurred.

Each of these models is a binary choice, and requires less data to make each choice … but there is no winning since you have to create the other models that make the additional choices.

In the end, I feel it’s all the same, and you need lots of data to distinguish hundreds of categories.

The binary model is very organized, but you need more models. You can use a fine tune or embeddings for each decision.

Fine-tunes tend to be black boxes, and embeddings are opaque to transparent boxes.

In the end, lot’s of work either way because of the high amount of categories involved.

Or you can shift work to cost, by bootstrapping with a multi-shot model to create your labeling/training data.

I have been experimenting more with embeddings as classifiers myself.

Classifiers using embeddings can have any topology too, so flat, like clusters (which is what I do) or binary trees (haven’t tried this one with embeddings).

So based on the method, I tend to gravitate towards a topology.

My preference is that with a fine-tune, I would use a binary tree, and with an embedding approach, I would use a flat correlation approach.

likhitha · April 20, 2024, 7:19am

yeah tried using gpt 3.5 16k ( not an instruct model) one but the performance is not good and consistent

likhitha · April 20, 2024, 7:29am

This is actually multi-label classification. Below are the two things I’m currently experimenting with.
Batch processing: Combine 10 (or more) messages with all 250 category definitions and get the output, if the performance is good cost will be reduced by 10 fold
Cascading : In first hit get the primary categories (L1) and in second hit send only corresponding secondary category definitions and get the final label. This way token length will be reduced and thereby the cost. Adding batch processing with cascading approach is quite complicated I guess.

Also for some set of messages where patterns are quite clear I’m using BERT based sentence embeddings and getting the classification. Using LLM only for those messages which are quite unpredictable and long one’s

curt.kennedy · April 20, 2024, 2:11pm

In my experience, multi-label, using embeddings, takes many more orders of magnitude of data for the system to converge to a low error rate.

I haven’t tried an extensive prompt that then generates labels from a large (250) set of labels to draw from, either, like you have.

So if you do have a hierarchy, I would first get your high level classification, and then break it down further from there.

This divide and conquer approach should give you more accurate labels, no matter what approach you pick, because your initial filtering is eliminating all the noise from all the wrong labels being applied.

Another thing to consider, besides embedding vectors, is using something like TF-IDF to get keywords, that then map to labels. You can actually do both keywords and embeddings at the same time, and combine them with RRF (reciprocal rank fusion), to give an overall best set of labels for each chunk of text.

So in the implementation I am thinking here, you correlate with semantics (embeddings / dense vectors) and correlate on keywords (sparse vectors), and the passages are the same chunks, just viewed through different lenses, and then combined with RRF. The weightings don’t have to be equal either, so lots of room to tune something like this.

You then harvest the top labels, with even higher weight going to keywords, or key phrases, that align highly with specific labels.

Topic		Replies	Views
Force GPT 3.5 Turbo to choose an answer from a set of predefined options API	5	435	June 7, 2024
Resolving ChatGPT hallucinations for text classification using IAB taxonomy Prompting gpt-4 , chatgpt	3	2310	July 23, 2023
How do I handle a large number of classes for classification API	12	3358	May 28, 2024
Best solution for multilabel classification API embeddings , classification , semantic-search	1	1681	October 20, 2023
GPT3 Finetuning for Multilabel Classification API	26	9265	October 17, 2024

Passing instructions only once for a message classifier use case

Related topics