Five rules for finetuning from my experience, observations, and consulting

I have been consulting with dozens of teams around the world on GPT-3 since before finetuning came out. Since then, I have done several finetuning experiments and found some limitations of finetuning - things it’s not yet good at, or that require very careful planning with your datasets. Here are my observations about people and teams who are interested in finetuning, and some advice I have.

Rule 1: Do not get into finetuning if you are not already good with GPT-3

There was one case of a grad student on Reddit who was asking me about finetuning for a thesis project. After some conversation, it quickly became apparent that he wasn’t even familiar with GPT-3, how to use it, or what it can do. I told him that he needed to start with the basics and learn the base tool first. I have seen several other teams assume that they’d need finetuning to perform their tasks, but with just a little bit of testing, I showed that they just need to start with basic prompts.

Obviously, I cannot read minds, but here’s what I think is going on: Many people are not taking the time to understand how to use GPT-3 or how powerful it is. They tend to assume that it’s just a basic NLP model that is pretty stupid. I remember having an argument on a Discord channel about how intelligent GPT-3 is and I commented that it was far more intelligent than the guy I was arguing with (this did not go over well, for obvious reasons). But that was a turning point for me, when I realized that GPT-3 contains more knowledge than any single human, and has a more robust language model (ability to use language to perform tasks) than any single human. For instance, GPT-3 can read and write legal documents, medical texts, philosophy, and high energy physics. Not only that, but the more recent DAVINCI models can do it far faster than humans can even think, let alone read and write.

Start with the assumption that GPT-3 is smarter than you. Just ask it to do the task you want, and it will surprise you.

This next statement my come across as offensive or controversial: If you don’t believe that GPT-3 is smarter than you, then you haven’t done your homework. I am sorry I have to be so direct, but I am tired of repeating this message. People want to argue over whether or not GPT-3 is “True AGI” yet or not, but that doesn’t really matter IMHO. The fact of the matter is that it can predict text at super-human levels on practically any domain.

If you don’t grasp this, and how to write good prompts to get the best use out of GPT-3, then you are not ready for finetuning.

Rule 2: Start with prompt engineering before getting into finetuning

Let’s assume that you’re already impressed by GPT-3 and you think it could be a proto-AGI. You know this thing is smarter than you and so you’re ready let the rubber meet the road. Even still, you should not get into finetuning until you’ve exhausted your ability to engineer better prompts. This is doubly true with the latest DAVINCI INSTRUCT models. The correct instruction makes all the difference in the world, and this is why I recommend to have a writer (or reader, or some other liberal arts person) on your team.

Computer scientists, for all their strengths, simply are not trained to think qualitatively. In order to get the most out of your GPT-3 prompts, you need to think with a very rich vocabulary. After all, it is language model so if you’re not good with language, you’re not going to be good with GPT-3. I have several examples of showing GPT-3 to different kinds of people, and those who get it the fastest and most intuitively are people like writers, librarians, and “soft sciences” like psychologists. You know - the kinds of people that some computer scientists like to criticize and mock. This technology means that we have to set aside our elitism and prejudice against other disciplines and come together.

Here’s a prime example of what I mean: if you ask GPT-3 to “translate this into simpler language” it will do its best to perform that task. However, there’s a much better, more concise word to use: summarize. The significance of the verb translate versus the verb summarize is going to be incredibly obvious to any writer, philosopher, or psychologist. To the computer scientist, though, the two tasks might seem identical. After all, if you summarize something, you are simply “translating it into simpler language”. But that’s not exactly how GPT-3 works and thinks, as it’s going to be very precise about the word choice you follow.

Rule 3: Find the limits of normal GPT-3 before you think of finetuning

When I was developing NLCA (my cognitive architecture), I stopped researching when I found the natural limit of GPT-3. I discovered that there were tasks that GPT-3 simply could not do, they were too arcane, too specific, or too complex. I was satisfied that I’d taken the technology as far as it could go, and my architecture was becoming increasingly fragile. As a side note, this is the exact reason I haven’t released the full code for my original NLCA instance - it worked sometimes, but because of the randomness and complexity, it often blew up spectacularly.

However, some of these problems were solved by finetuning, which is why I’ve been slowly working on finetuning datasets to make my cognitive architecture more stable. But again, I only came to this after months and months of research and testing with normal GPT-3 (both INSTRUCT and vanilla/original models). Because of those months of work, I gained a very strong intuition about what GPT-3 can and cannot do well. Unless you already have this good mental model of GPT-3, you’re probably not ready for finetuning.

I don’t want to gatekeep and try and keep people out of finetuning. I’m not saying that. I’m not saying “I’m smarter than you so you should just buzz off!” - but what I’m trying to say is that there is a level of mastery that is required with normal GPT-3 before you understand its strengths and weaknesses, which is required to make the most of both ordinary prompts and finetuning.

There’s another good reason to avoid finetuning: creativity. With ordinary prompts, you can easily maintain a lot of creativity within GPT-3. However, once you finetune a model, you’ll notice its creativity drops drastically, and it will become much more rigid in its thinking. This leads to my next point.

Rule 4: Finetuning is best for consistency and specific tasks

Let’s say you have done an experiment and you’re looking for very specific kinds of responses, with very specific formats or other criteria. You’ve wrangled and struggled with ordinary prompts but you still just can’t get the consistency you want. Maybe you have a customer service chatbot that is too sensitive to customer frustration, and the chatbot becomes abusive, or starts talking about things its not supposed to. Or let’s say you have a complex set of requirements and no matter what you do, GPT-3 just keeps going off the rails and making up it’s own mind.

You’ve tried zero shot and few shot learning and gotten help on prompts and it’s still not being consistent enough. Now it might be time to consider finetuning. Remember, though, that consistency comes at the cost of creativity and flexibility. If I ever figure out how to maintain full creativity (which is basically a form of confabulation) while still staying on task, I will let everyone know. But so far, finetuning seems to clamp down on imagination and creativity (aka confabulation) which does carry the benefit of consistency, but it comes at the cost of making GPT-3 more like a broken record.

Now, even if you KNOW for a fact that your task requires consistency, don’t jump the gun and assume that finetuning is the answer. Right out of the box, without any special treatment, GPT-3 can generate very consistent results with the right prompt. See rules #2 and #3 - exhaust prompt engineering and find the limits of normal GPT-3 before jumping off into the deep end.

Rule 5: Finetuning is way more work than you think it is

With the latest iterations of INSTRUCT models, prompt engineering basically comes down to the PB&J test: can you write clear, concise instructions? If so, then a good prompt will take you a couple minutes to write and fiddle with. You can rapidly iterate by tweaking some verbs and adjectives here and there, and by adding few shot examples to get the consistency that you want.

Finetuning? Okay, where do we start…

Namely, you need 200 samples minimum. You can probably bootstrap with less, but you might be wasting your time. If you haven’t done the work to figure out the limitations of normal prompts, then you might not even understand what you’re looking for. I’ve seen this several times in the past: A team has a vague notion of what they want to achieve but they cannot articulate it in concrete terms, or even come up with examples. I’ve discovered that this kind of magical thinking leads to a lot of frustration and conclusions like “GPT-3 just isn’t that good”. In reality, if YOU don’t understand your task, there’s not a snowball’s chance in heck that GPT-3 is going to understand your task either! GPT-3 may be smarter than you, but it can’t read your mind, so if you don’t have a clear expectation of what you want out of the machine, you’re going to be disappointed.

“I’ll know it when I see it” is a maverick mentality and, while there is some value in exploration and discovery and just playing with GPT-3, this mentality should be long gone before you decide to finetune. I am speaking from personal experience as much as I’m speaking about observations I’ve made in other teams. The art and science of curating a finetuning dataset will force you to confront just how poorly you understand your own problem.

Here’s an example: I have been working on developing my Core Objective Functions idea (like Asimov’s Three Laws of Robotics, except actually good). I played around with it in the Playground and it works well enough: GPT-3 understand the idea of “reduce suffering, increase prosperity, and increase understanding” well enough. These are ordinary verbs and nouns with deeply embedded meanings. Nothing arcane or obscure here. But when I sat down to figure out how to compose a finetuning dataset for this task, I froze up. I fiddled and futzed with it for literally months and still haven’t gotten great results. Why? The reason is because my objective is a vague notion of “Make a finetuning dataset that can serve as a moral center for AGI”. You know, nice and simple and clear, right?

As a senior data scientist once told me: Getting good data is ALWAYS the hardest part of this job - that problem does not go away with finetuning! It only gets harder. The thing about prompt engineering is that you don’t need any data, you just need your imagination and your skill with words. In this respect, finetuning is an entirely different discipline from ordinary GPT-3 usage.

In summary…

I could keep going, but I think these five rules cover just about most of it. I hope this helps. It’s just been my experience that many people are not ready for finetuning even if they are convinced they need it, and in many cases, finetuning is not needed at all for their task.

26 Likes

paging @bakztfuture - I know you like talking about this stuff on your podcast, feel free to mention this on your channel!

3 Likes

Great practical advice! @daveshapautomator. GPT-3 is such a versatile and powerful model that, I haven’t found a reason to go for a fine-tune other than for saving on OpEx for very specific high volume purpose.

Exploring GPT-3 in playground and prompt engineering should be the first things newcomers should be doing.

2 Likes

Thank you so much for sharing this resource!! :scream::scream::scream:

2 Likes

I have the same feeling as you, I am in a discussion group of the China Huawei Company developing the Chinese version of GPT-3. Most of the members in the group have never used GPT-3, they have a very limited understanding of what a humanlike NLPmodel can do. they use Chinese NLP to test the question:“I am on Earth and I jumped 5 meters in the long jump”. and " I’m on the moon, standing long jump out 5 meters", let the model answer true or not true , but the Chinese version gpt-3 model can not answer correctly, but i use gpt-3 to answer the questions here are my prompts and results:

the gravitational force on the Moon’s surface is about one-sixth of the Earth’s. If I can jump 1 meter in the long jump on Earth, I would jump 5 meters on the moon. Based on the evidence, If the assertion is true, the answer is “correct.” If the assertion is not true, the answer is “incorrect.”

assertion: I am on Earth and I jumped 5 meters in the long jump.

answer: incorrect

assertion: I’m on the moon, standing long jump out 5 meters
answer: correct

See, even that is extremely limited, because the output requested in a Boolean. You don’t need a LLM to do that. But your example perfectly illustrates my point: developers and computer scientists tend to think in explicit, quantitative terms (Boolean, statistics, etc). They are not trained to qualify the value of the following kind of output, even though it’s remarkable. It requires other disciplines to fully appreciate how incredibly this is.

8 Likes

THIS IS EXCELLENT. This is actually why I haven’t dabbled in fine-tuning yet. Even though I’m familiar with prompt engineering, I would like to find the best use-case that would require fine-tuning a model which takes time and consideration for a lone developer like me. Thank you so much for this tutorial @daveshapautomator!

3 Likes

re #4 above - you can get your finetunes to be less repetitive, though still within tuned parameters, but still creative, by pushing the temperature up to .90-1.0, though it becomes very, very wild at that point. I also tend to add in the f=1.73, p=0.43, to try to remove repetition in that wildness.

2 Likes

Thank you so much for this observation. I suspected something like this would be true. Prompt engineering may be the 80% solution for most use cases.

1 Like

I’ve made a YouTube video based on this post:

3 Likes