Accuracy of few-shot learning vs. fine-tuning for tens of examples

Context: I’m wondering about classification problems with tens of training examples, say something like sentiment analysis of tweets, but for different, more challenging problems.

I understand that the mechanism of few-shot learning by giving a number of examples as part of a prompt is quite different from that of fine-tuning the model and I am wondering how to best deal with cases where one has a relatively large number of examples for prompt-based few-shot learning, but relatively few for fine-tuning — say 20 to 100 examples?

My sense is that for a small number of examples, the few-shot learning approach is significantly more effective than fine-tuning with the same examples. Is there a way to make fine-tuning classification accuracy approach that of few-shot learning with limited data?

Is doing both useful? I.e. doing fine-tuning with 30 examples and then using 10 of those, or 10 others, in the prompt during inference?


I think fine-tuning tends to work better even at 20 (or more) examples. And can be worth testing with fewer, as you can probably use a smaller model for similar accuracy. One caveat is that fine-tuning can be unstable, so picking a good set of balanced examples and potentially trying a couple training runs will help. (though there’s some instability with few-shot based on the examples too probably).

Also don’t think there’s much benefit to doing both together - but you can try!

Thanks for your thoughts, Michael. Appreciated.

What kind of labels are you trying to get?

  1. Five classes, simply numbered 1 to 5. Unfortunately, my client isn’t comfortable with me publicly sharing details of the application, but I can say it’s a nontrivial psychometrics problem where the classes are highly abstract without obvious vocabulary associated with them.

  2. Also interested in a related problem, which isn’t classification, but essentially a sort of strong summarisation – essentially having a prompt summarised to a large, but semantically restricted set of keywords. (Not detection of keywords.)

This is not a good use of GPT-3 nor is it a good approach to this problem. Use real words in your training set as GPT-3 understands those words, and I guarantee it understands psychology better than any single psychologist. You can scriptomatically translate the real words into numbers later. Anyways, this is maybe a CURIE level problem but I still think GPT-3 is overkill, why not use SVM and bag of words technique for such a simple problem? GPT-3 is capable of speculating on the motivations and internal reasoning for why people do what they do.

(note: this is from a real tweet)

This is also a trivial task which I wrote about in my book.

Thanks for the response, David. I appreciate your time.

  1. I’ve considered using word labels for the classes and will give that a try. Given the complexity of the associations I’m after, the terms aren’t all that descriptive though. I doubted it would be semantically significant to GPT-3, but I could embellish the labels to add some semantic richness.

I’ve come at the problem in several ways, with and without GPT-3. Curie’s performance approached chance. Da Vinci was better, but still way below human accuracy. The particular application is not one of the standard psychometric instruments and is tough even for skilled, highly trained humans.

The tweet example is interesting and I’ve done similar things for this problem. However, while my results looked good at first glance, comparing it to human experts revealed a lack of real traction on the problem by GPT-3. I’m interested if you perhaps have inter-rater reliability scores or some other measure of accuracy for GPT-3 compared to humans on assessing tweets in the way you did?

  1. I’ve been impressed by GPT-3’s ability to summarise text and extract keywords in early tests. I’m after highly abstract characterisations of the content, though — “keyword” is maybe the wrong term. Either way, I guess I just need to experiment more.

If this is the case, then you do not have the categories or problem clearly identified. Once you clarify the goal, categories, and method, DAVINCI will outperform anything that a human can do, especially on classification tasks.

I sympathise @rikus because I’m also using GPT-3 in the context of a highly specialized and technical field where accuracy is crucial. GPT-3 is excellent at language generation, which can make it very decieving and appear much smarter than it really is. Language generation is not the same as truth generation! (As an aside, sometimes I wish there was a sub-forum for GPT-3 users working on objective-/science-/fact-/truth-/accuracy-driven use cases. These use cases often deal with non-everyday language and subsets of the world’s information for which GPT-3’s training is both amazing and limited.) For my project, I have found embeddings to be my best friend. If you have five classes and only 100-1,000 pieces of data, by the time you create a fine-tuned model, you might have done as much work you would have with human labelling, and with poorer results. Perhaps you could try embedding your data, then clustering it. The clusters might provide you with clues as to how you could either (1) assign a textual vocabulary to your classes, which you can also get embeddings for, or (2) maybe even figure out a way to assign numerial embeddings to your classes. I’m out of my depth now but perhaps an “average” embedding for each cluster with standard deviations could be generated as the basis for your classification step. Good luck! I’m always interested in use cases that bear some similarity to my own. Best, Leslie


Thanks Leslie, I appreciate your thoughts.

I started out with embeddings. My current implementation that uses sentence embeddings performs roughly on par with the best performance I’ve managed to coax out of GPT-3 so far. I’m using the SentenceTransformers Python package for that.

Clustering in the high-dimensional embedding space is tricky. I tried the excellent UMAP dimension reduction algorithm to project the embedding vectors down to two or three dimensions. That produced some (semi-supervised) clusters that looked promising, but turned out not to generalise at all.

More experiments, I guess. Good luck with your project too!