This is my first post on the forum, and my use case is a bit unique.
Context:
I am an intern and need to set up a ticket categorization/automatic response system for support tickets dedicated to a SaaS for my company using AI (here OpenAI). These tickets contain very specific information, including images and CSV files, for example.
For now, I am focusing solely on the “categorization” part. Here’s what’s happening:
Ticket creation detection
Extraction of images (these are URLs, not interpretable in the prompt) and creation of a summary for each image using OpenAI, which we integrate into the prompt
Automatic categorization
Review by a support engineer: they classify the ticket and can explain to the model why its automatic categorization is incorrect in a dedicated field.
So, I have implemented a whole pipeline to create a dataset in the following format: [ticket_id/ticket_title/ticket_content/ticket_category/model_prediction/model_prediction_feedback]
The downside is that I need to deploy this system quickly without going through a web app (the joy of corporate restrictions…), so I can’t implement a dedicated RAG for this use case.
With fine-tuning, I can create a dataset with weights on the quality of the answers: for example, if ticket_category ≠ model_prediction, then incorporate model_prediction_feedback into the prompt to improve the model’s quality. However, this is quite costly, and the results are not always excellent (I currently have only 400 entries, 12 categories, and a certain complexity in the classification task) + it seems that finetuning is not a good option for this kind of usecase…
The vector store might be a good alternative… But I’m not a fan of the black box effect where we don’t know exactly the form of the information being retrieved (unlike a locally implemented RAG). I’d like to know how I could create a kind of “categorizations samples datasets” in the vector store which can be comprehensible by the assistant.
I have the data, I just don’t know how to format it in the most efficient way for my usecase. And I spent ours on the internet trying to find corresponding resources but nothing about this “no RAG code” approach
Also what do you think could be the best ? Create one file per example ? Or fill all examples in one file ?
I’m not sure if this is helpful, but one thing I’ve leveraged is something like Marvin (askmarvin.ai - I can’t post links) to do Classification on an input.
You can feed it images, text, both, whatever. And then you can provide it “instructions” that can include typical prompt techniques like few-shot examples. I’ve found that the LLM is pretty good if you provide it with a good enough prompt/instruction AND descriptive classification enums (think about something like: “support_ticket” vs “login_fail_support_ticket”).
I’ve also leveraged Marvin’s “extraction” - which can be done similarly with Structured Outputs - where the model you pass is like a classification request.
class ClassificationExample(BaseModel):
classification: SomeEnum = Field(..., description="The classification of the example", examples=["Example1", "Example2", "Whatever"])
reason: Optional[str] = None
You can do this with a number of frameworks/tools like Instructor, Outlines, native calls (with good prompts), whatever. I just mention Marvin because its extremely quick to use and I’m most familiar with it.
Hi ! Thank you for your answer Saddly, your solution doesn’t fit to my problem…
I forgot to explain that the ticket analyse trigger is done with Zapier, and so I need to integrate solutions directly in a Zap. So I want to keep using OpenAI because it is really easy to integrate in Zapier. That’s why I try call this a “no code RAG approach”.
If I show great results using this Zapier solution then I’d have the possibility to start computing a real Python web app for this use case
I haven’t used Zapier in ages. Can you pass JSON to the OpenAI call? Theoretically, you could compile a Pydantic model to schema (or go about building a schema yourself) - like you see here and pass that to the OpenAI call.
Even without this, I would experiment with classification via a prompt like you see here. There are a lot of resources on prompt engineering that you could leverage!
My issue is not about the model format output (pydantic etc.), but about the accuracy/precision etc. of the categorizations.
My model always give me a correct answer format (the category name).
What I want is to improve the quality of the categorization and except with Few-Shots I don’t know what I can do.
My first thought was to fine-tune the model with a lot of examples but apparently fine-tuning is not a good fit for classification…
I wanted then to add a lot of examples in the vector store but It didn’t really improve the quality of the classification…
And Few-Shots are a good approach, but due the high variability of inputs it start struggling with more sophisticated tickets…
Hi! Can you elaborate why you think fine-tuning would not work here? Normally, classification is a great fit for fine-tuning. I have used it extensively myself.
Embeddings-based classification can also work extremely well depending on the nature of your classifications.
From everything I’ve read, it seems that OpenAI’s fine-tuning is primarily intended for formatting output rather than specifying tasks.
While I’ve achieved some interesting results—though not quite good enough—I began to wonder if my success was merely due to “luck” in my particular use case based on this understanding.
Could I have been mistaken about this? Can you explain briefly what was your classification use case ? Thanks a lot
Additionally I have created way for support engineers to create a “feedback” when the model make a miscategorization. So I was really enjoyed to see that fine-tuning allowed us to provide “weighted” answers (like a kind of RLHF).
Do you think that I was on the right way and that I just started to overthink about all of this ?
First of all, if you have not had a chance, this official guide provides a good overview of fine-tuning including when fine-tuning is suitable.
One of my core use cases involves classifying news items into different categories. I have multiple fine-tuned models for different types of classifiers. Some of them are single-label classification tasks, another one involves multi-classification. While over time I have optimized my approach, in general I can say that the fine-tuned models have done a very reliable job in classification.
For the single label tasks the labels to choose from are up to 15. For the multi-label classification task at one point I even had the model select from 100 labels and even in that case I got to work well.
For your information: I used anywhere between 1,000 - 3,000+ manually labelled training examples to create my models (more examples for the multi-classification task). You can likely get decent results with fewer examples but this is just to give you a reference point.
I’d just like to know a bit more about your system prompt in the dataset.
Do you give the categories and their descriptions ? Do you use few-shot prompting ? Do you even use a system prompt ?
I do use either system or user prompt - in practice it does not matter much whether you use one or the other. As part of the prompt I only supply the category labels without any further description; through the fine-tuning it has been picking up the differences well. I don‘t have it handy now but I can share a disguised version of my prompt later. It‘s relatively succinct and straightforward but has worked well for me.