Passing instructions only once for a message classifier use case

Building a message classifier with gpt4 API. Currently using prompt with detailed category/label definitions to classify a message.

However, the problem is instructions (definitions in this case) amount to ~6k tokens and must accompany every user message, thus raising the cost. Even with assistant API, this problem persists. Tried out batch processing but observed performance degradation. I cannot consider even fine-tuning as there is not much-labelled data available.

Aware that there is no memory option as such while using API to store the instructions. Is there any better approach that I can try for cost optimization?

1 Like

Heya! Welcome to the forum.

Have you tried using a multi-shot with GPT-3.5-turbo or GPT-3.5-turbo-instruct? I’d try with the latter. Maybe start with 10-shot then work your way down. Shouldn’t need too many.

If you share the prompt, we might be able to help simplify/improve that as well.

There’s a great Prompting category here too.

1 Like

With a small amount of data, look into creating a fine-tuned model.

With a large amount of data, I would use correlation and labeled embeddings.

2 Likes

Thanks for your quick response.

Tried with gpt3.5 turbo instruct earlier but as no.of categories increased due to token limit issue switched to gpt4. The prompt contains short definitions of about 250 different categories. Few shot would take even more tokens as there are so many categories (~250).

1 Like

Yeah if there is no other alternative, have to work on getting some labeled data and using fine-tuned model

1 Like

Hrm. Maybe split those 250 into smaller batches of similar categories and find out what your user request is using a cheaper / faster model to give it a finer-detailed category. That way you’re not sending the entire 250 each time even when you don’t need to send ALL of them likely?

Datasets are gold these days… especially clean/edited ones…

Currently there is a hierarchy for these categories from primary(high level) to tertiary (much detailed one’s ). As you mentioned I will try with a batching approach.

Thank you!!

1 Like

No problem.

If it works or not, please come back to let us know.

We’ve got a great dev community growing here, and we’d love to have you part of the group.

@curt.kennedy’s got a great breakdown too.

Good luck and happy coding!

ETA: There’s a 16k context gpt-3.5-turbo i believe (not instruct sadly!) but if you can recode that bit for ChatML, that’s another option…

250 categories is challenging. You need a lot of data in any case to distinguish each category for either the fine-tune or embedding approaches.

The data will either have to come from a human or a highly reliable model.

But data is everything.

But like @PaulBellow mentions, you can try a traditional and proven divide and conquer binary tree approach, if that makes sense for your categories.

This works because forking each branch is only 1 bit, and requires less date to train, but you end up cascading more models to form the entire tree. So it’s a trade.

1 Like

Yeah, I would keep it to 1 or 2 hops with Ada maybe even? Or is GPT-3.5-turbo faster now? Most I’m using in production is one low-cost call then the real call… Even on tier 5 billing, my long 2k+ GPT-4 prompts take a long time on peak-hours, so I stick to 3.5 for as much as I can.

For the record, I’m on team #gpt4-instruct, though I’m not sure we’ll ever see that model as it’s considered “legacy” at the moment.

Still a lot of improvements overall to the architecture, though, especially on the sometimes easier to understand Assistants side. The batch for cheaper is cool too, and I want to put that to use soon.

So many improvements! :wink:

3 Likes

Welcome to the OpenAI community @likhitha

You could use embeddings to cluster the labels.

Once that’s done find the closest cluster to your data to be classified.

Then run classification only for that cluster with gpt-4 or turbo.

2 Likes

Since there is a hierarchy of categories, say red, orange, blue, violet.

You train on warm vs. cold colors, so red/orange vs. blue/violet.

The once you get this result, you break down further, so train on distinguishing red vs. orange with one model, and distinguish blue vs. violet with another model.

So you have 3 models total. One to break down warm vs. cold, and 2 follower models to break down further depending on where the first branch occurred.

Each of these models is a binary choice, and requires less data to make each choice … but there is no winning since you have to create the other models that make the additional choices. :rofl:

In the end, I feel it’s all the same, and you need lots of data to distinguish hundreds of categories.

The binary model is very organized, but you need more models. You can use a fine tune or embeddings for each decision.

Fine-tunes tend to be black boxes, and embeddings are opaque to transparent boxes.

In the end, lot’s of work either way because of the high amount of categories involved.

Or you can shift work to cost, by bootstrapping with a multi-shot model to create your labeling/training data.

I have been experimenting more with embeddings as classifiers myself.

Classifiers using embeddings can have any topology too, so flat, like clusters (which is what I do) or binary trees (haven’t tried this one with embeddings).

So based on the method, I tend to gravitate towards a topology.

My preference is that with a fine-tune, I would use a binary tree, and with an embedding approach, I would use a flat correlation approach.

2 Likes

yeah tried using gpt 3.5 16k ( not an instruct model) one but the performance is not good and consistent

This is actually multi-label classification. Below are the two things I’m currently experimenting with.
Batch processing: Combine 10 (or more) messages with all 250 category definitions and get the output, if the performance is good cost will be reduced by 10 fold
Cascading : In first hit get the primary categories (L1) and in second hit send only corresponding secondary category definitions and get the final label. This way token length will be reduced and thereby the cost. Adding batch processing with cascading approach is quite complicated I guess.

Also for some set of messages where patterns are quite clear I’m using BERT based sentence embeddings and getting the classification. Using LLM only for those messages which are quite unpredictable and long one’s

In my experience, multi-label, using embeddings, takes many more orders of magnitude of data for the system to converge to a low error rate.

I haven’t tried an extensive prompt that then generates labels from a large (250) set of labels to draw from, either, like you have.

So if you do have a hierarchy, I would first get your high level classification, and then break it down further from there.

This divide and conquer approach should give you more accurate labels, no matter what approach you pick, because your initial filtering is eliminating all the noise from all the wrong labels being applied.

Another thing to consider, besides embedding vectors, is using something like TF-IDF to get keywords, that then map to labels. You can actually do both keywords and embeddings at the same time, and combine them with RRF (reciprocal rank fusion), to give an overall best set of labels for each chunk of text.

So in the implementation I am thinking here, you correlate with semantics (embeddings / dense vectors) and correlate on keywords (sparse vectors), and the passages are the same chunks, just viewed through different lenses, and then combined with RRF. The weightings don’t have to be equal either, so lots of room to tune something like this.

You then harvest the top labels, with even higher weight going to keywords, or key phrases, that align highly with specific labels.