Fine tuning for use of keyword lists

I am finding tuning a model to assign keywords to various documents. The catch is that the keywords must always be chosen from a set list of about 300 possible keywords for a few specific reasons that are irrelevant to this post. This obviously makes for a long system prompt/user prompt depending on how the query is structured. My question is whether or not I can expect the model to “learn” the keywords through the fine turning process such that I would no longer need to include them in the system message or user prompt in deployment? If so am I right in thinking I would need somewhere around 10-20 examples of each keyword in the training set to get it to do this? Or would the model start to “remember” the list if it just appears 1000+ times in the system prompt in the training data?

You have a question that I touched on just yesterday in the forum; you can see if I’ve offered enough opinion for you to go off of there.

Yes that makes sense. I have a ft model that is significantly more reliable on this task than GPT 4, but you have to give it the keyword list each API call or it “goes off script”. Not sure if the function calling or lookup approach would actually result in token savings or not in the long run…but a good idea.

If I understand your use-case correctly you are feeding GPT a document and you want it to classify it using specific keywords?

If so, yes, fine-tuning is a great option. Fine-tuning for classification purposes is a very common use-case. So you “teach” it the behavior of classification, and then use the keywords as a byproduct for knowledge.

Another option is to use a vector database and then compare the documents to your list of keywords along with their description. I’d recommend considering and at least trying this out first.

An example in the legacy fine-tuning guide is actually quite similar to what I believe you are looking for:

Case study: Categorization for Email triage

Let’s say you’d like to categorize incoming email into one of a large number of predefined categories. For classification into a large number of categories, we recommend you convert those categories into numbers, which will work well up to ~500 categories. We’ve observed that adding a space before the number sometimes slightly helps the performance, due to tokenization. You may want to structure your training data as follows:

{
    "prompt": "Subject: <email_subject>\nFrom:<customer_name>\nDate:<date>\nContent:<email_body>\n\n###\n\n",
    "completion": " <numerical_category>"
}

For example:

{
    "prompt": "Subject: Update my address\nFrom:Joe Doe\nTo:support@ourcompany.com\nDate:2021-06-03\nContent:Hi,\nI would like to update my billing address to match my delivery address.\n\nPlease let me know once done.\n\nThanks,\nJoe\n\n###\n\n",
    "completion": " 4"
}

In the example above we used an incoming email capped at 2043 tokens as input. (This allows for a 4 token separator and a one token completion, summing up to 2048.) As a separator we used \n\n###\n\n and we removed any occurrence of ### within the email.

Note: The separators are no longer needed. I also believe the whitespace isn’t either?

A good rule of thumb with fine-tuning is if you are sending a massive amount of static tokens each prompt you can usually use fine-tuning to “bake them in”. Just keep in mind that it’s hard to “unbake it”. So if your keywords are constantly changing you may want to try something else.

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won’t need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.

I think there are probably better ways than fine-tuning to get what you want.

The first thing I would recommend would be to put these ~300 keywords into a hierarchical structure because 300 distinct keywords is way too many for an LLM (or a human being even) to be able to reliably tag.

I think your best results would come from having only 4–5 divisions at each level, then you could iteratively interrogate the model for ever more fine-gained classifications. Note: You might be able to do this as a type of chain-of-thought prompt.

Another approach might be to give the model free-reign to pick keywords then, for each of the generated keywords, find the most semantically similar keyword from your allowed list and, as a final check, query the model as to the appropriateness of each retrieved keyword for the original document.

I just think the number of examples you would need in your training set to get the model to reliably generate mixed-membership classifications for ~300 classes is likely on the order of 10’s to 100’s of thousands of document-keywords pairs.

Doesn’t fit with

Although in theory I like the thought of an iterative process and can agree that it could have better results than one-shotting a massive list. I can’t see why fine-tuning wouldn’t be a more suitable option if the results are at the very least similar. Less tokens, much less latency. CoT can also be accomplished in fine-tuning, and I think would be beneficial as well.

Another issue I am seeing with this solution is if there are multiple keywords from multiple categories. As soon as I start to see a web of logic I fall-back to fine-tuning.

I’m not saying you’re wrong btw. I think this is entering the realm of nuance. Just curious

Yeah, I’m aware of that documentation. What I would point out though is that this isn’t strictly a classification problem—documents can be and almost certainly should be associated with a varying number of different keywords.

This has the possibility to make the problem vastly more complex for the model as there might be complicated interplay between the keywords.

That’s why, even with this documentation in mind, I believe ±300 keywords is too many for a model to be able to reliably select from.

There’s also the question as to identifying how many keywords a particular document should be tagged with.

Some of this would be infinitely easier with an old-fashioned neural network classifier because we could just look at the values in the output layer and use some heuristics to choose a cutoff point. That’s just not possible with an LLM.

The more I think about it, the more I’m convinced the best approach would be to lean into embeddings.

I see a few ways this could be done.

  1. What I’ve already described—ask the model to create its own list of suggested keywords, g, for a document, then compute the pair-wise similarities for each of the generated keywords g_{i} with the static list of acceptable keywords k . With some minimum threshold, from k choose a subset k^\ast based on the similarities to the elements of g. Then sanity check it by submitting k^\ast back to the model as a proposed list of keywords and ask it to choose the best options from that set.
  2. Alternately, with enough document-keywords pairs, one could simply create embeddings of the documents (or their most important sections if they’re too large) then create an embedding vector for each document which needs keywords associated with it, find those documents with known keywords which are most semantically similar, aggregate those keywords as candidates and submit them to the model to choose from/rank/whatever.

There is a possibility I’m wrong though and the model will do this well, I just wouldn’t bet on it. There are known biases for the models to pay greater attention to (and thus preferentially select among the first few and last few options in large lists, as well as biases towards a moderate number (3–5) of choices from large lists.

For instance, a model would almost certainly never respond that zero of the keywords are appropriate, nor is it likely to suggest 20 of them are good fits—even if those were the correct answers.

Anyway, I would be very excited to see someone (not me) produce some empirical results for this type of task. I also think it’s a great test-case for an eval/benchmark.

1 Like

Separators are needed unless they arise organically.

They are the actual “prompt” for the AI, for it to recognize where it is to begin writing an answer and not writing more of the user’s email.

All really interesting points. For context, I managed to get a fine tuned model working at about 88-92% of the human expert level on this task. I trained it using chain of thought training examples, breaking the documents into sections and assigning keywords to each section. My evals were 100 manually coded documents . Comparing the keywords from the model to the human we consistently get 88-92% accuracy (meaning that 8-12 human assigned keywords are missing from the model’s keywords). Often the model adds additional keywords from the list but this is less of an issue. FT was done on about 2 million tokens. My main concern was reducing cost but maintaining speed going forward as passing 2k context every time with 1000s of queries gets expensive. Interestingly, base GPT-4 scored 78-84%, GPT turbo 78-86% but I had to change the prompt around to get that.

Ok, yeah, I do like this. Give the latest model full reign and then magnetize the generated keywords to the pre-approved list & validate. Takes full advantage of a model like GPT-4 and also an embedding model, and doesn’t lock the keywords in.

Although I would still argue (for the sake of arguing) that the same result may be possible with fine-tuning. Interplay, and subjective meaning of what “keywords” to extract to me seems to fall straight into the lap of fine-tuning. If my eyes go cross-eyed trying to logic it out I just fine-tine it up.

I’m not too sure about this. I have fine-tuned Ada models that are very capable of refusing tasks when nothing is sufficient. Such as if a product name (from hundreds that wasn’t learned in the base model) isn’t in its vocabulary or if the query isn’t suitable for the response (irrelevant). Being brutally honest, it does seem like a daunting task to accomplish though.

Lost-in-the-middle. There was a cool write-up here of someone providing a decent solution to this issue. I’ll try and find it. The general gist was intuition rocks… Repeat the content in the middle lol

Me too. This would be very cool to try out! Places it on my shelf of 1,000 other things.

I think if OpenAI hosted competitions they would get a lot of traction and powerful insights.

Nice results!

Do you mind describing your use-case a little more?

Yeah. I think it would be better to have more than less, as you can just compare and knock off any items that are missing.

Although it would be interesting to understand why it wanted to add those keywords so badly.

@mark_humphries

If your list is the same, I’d embed each keyword to get a vector, then when the text is to classify - I’d embed the text and compare it’s vector to all 300 vectors for each keyword to sort keywords by relevance to the text (cosine similarity for example). Then, I’d take x number of keywords from the top of the sorted list.

You might also check this post here on a different approach to a similar problem: How I cluster/segment my text after embeddings process for easy understanding? - #9 by sergeliatko

I tried this actually with both a couple of transformer models and Ada but the results were not useful. There were two issues. First, the data being coded is pretty domain specific which meant out of the box embeddings models were not the best choice. Second, left to its own devices GPT-4 can assign some pretty random keywords sometimes and even worse seems to default towards vague, simplistic terms which are not very helpful (and also don’t map well). I think this would work very well on a fairly generic task and less so on coding domain specific knowledge.

Sure! It’s keyword coding historical documents of various lengths in a database for use with a search function as part of a RAG pipeline. Yes I am using semantic search too but it’s not accurate enough for this use case as the queries are typically open ended. HyDE works somewhat OK but not perfectly either. The main problem is that any form of semantic search misses at least some of the relevant documents and that matters in this use case as we need to be able to guarantee that our analytical model considers all relevant documents not just some. So we use a combined keyword/semantic search approach. By using a finite list of keywords, I’ve found that this maximizes retrieval efficiency as the model analyzing the user query chooses from the same list as the model assigning the keywords to the documents. Semantic search then catches other documents that might be relevant in terms of meaning but don’t directly match. The alternative would have been FTing or training a customized embedding model but even then I am not sure it would have been better.

1 Like

@mark_humphries without actual data samples cannot tell more, sorry.

I’ve done the same! Yes it makes sense (I think) to assign additional keywords to each document that help contrast it to others.

Bring the implicitness from hiding!

I’m out right now but random thought. Maybe to a Ft model as a “connector”? :thinking: So if you trained it to reduce the documents to keywords could you not flip it and use it to also perform query expansion by applying the same keywords?

There’s a cool model where you actually prefix it with a task label for easy separations

In my experience it’s almost necessary to perform work on a user query before running it against the vector database. I actually used a metaphone with my keywords to catch user misspelling but I wonder if it would’ve made more sense to just have a simple fine tuned model