Fine-Tuning In a Nutshell with a Single Line JSONL File and n_epochs

Those who frequent this community easily notice that developers struggle to get good fine-tunings results. In this multi-part tutorial, I’m going to use the davinci base model and fine-tune it multiple times with the same single-line prompt using different values for n_epochs.

I will take a completion with each of these models, and demonstrate that if you set your n_epochs value high enough, you can get excellent results fine-tuning with a single-line prompt!

Before we get started, I would like to acknowledge OpenAI who granted me a few extra credits so I could continue to help developers here by running embeddings, fine-tunings, completions and other API calls, so I can continue testing developer problems and posting here. Running all these tests cost money (not that much, but it adds up), and thanks to OpenAI, I am free to be more creative in helping other developers in my lab setup.

So, let’s get started!

First of all, let me introduce you to the simple single-line JSONL fine-tuning line we will be working with in this tutorial:

{"prompt":"What is your favorite color? ++++", "completion":" My super favorite color is blue. ####"}

Notice that the JSONL line above meets all the OpenAI criteria for a properly formatted JSONL key:value line item, namely:

  • The prompt ends with a separator, in this tutorial I will use ++++.
  • The completion begins with a single white space.
  • The completion ends with a stop, in this tutorial I will use ####.

Note, that I have coded a validator, but since this single-line JSONL file is so small, I’m not going to do anything but show you. how it looks in my “lab” setup:

Validation Function

Validation Results:

Note:

I strongly encourage all developers who are fine-tuning to validate the JSONL data for both JSONL compliance and also compliance with the OpenAI API “Preparing you Dataset” guidelines.. You can do this with a REGEX or another method that fits your coding style and experience. I use a REGEX.

New Fine-Tuning Params

In the following screenshot, this is my current “new fine-tuning” method and you can see that I have many preset n_epoch values I will test and share with you, including the values 4,8,16 and 32. You will see the completions results (the good and bad) for each of these n_epochs value:

Set Up Summary

I fine-tuned the base davinci model for many different n_epochs values, and, for those who want to know the bottom line and not read the entire tutorial and examples, the “bottom line” is that if you set your n_epochs value high enough (and your JSONL data is properly formatted), you can get great results fine-tuning even with a single-line JSONL file!

In the next screen grab, I show how I list my fine-tuned models:

List Fine-Tuned Models Function

I plan to use the following fine-tuned models for completions to demonstrate how to use n_epochs to get great results:

Fine Tuned Models and the n_epochs Value

  • davinci:ft-personal-2023-02-14-06-55-17 (4 n_epochs, the default)
  • davinci:ft-personal-2023-02-14-06-28-14 (8 n_epochs)
  • davinci:ft-personal-2023-02-14-09-01-20 (16 n_epochs)
  • davinci:ft-personal-2023-02-14-07-05-48 (32 n_epochs)

So, now for the results… !

I will reply to this post with the results and you can see how accurate or not accurate each completion is based on the n_epochs number.

Stay tuned.

:slight_smile:

28 Likes

Very well done tutorial @ruby_coder

Although it’s definitely a good idea to have more prompt completion pairs to cover edge cases, depending on the scenario.

3 Likes

Yes, but that is not the purpose of this tutorial.

I’m going to demonstrate the basics first with a single-line JSON file and how n_epochs values effect the results only and in a very controlled setup.

Let’s consider working together on a follow-up which goes down that path? Can we discuss after this one concludes @sps ?

:slight_smile:

4 Likes

THE UGLY

First, let’s look at the “ugly”, which is using an n_epochs value of 4, the current default.

Completion Setup

Completion

Comments

I cannot help but to LOL and ROTFL over this. This is what many people experience when they attempt to fine-tune. They post here in “hair pulling out” anguish, because their fine-tunings are so bad.

Let me assure all friendly developers, increasing n_epochs will change this as we get to a kind of “tipping point” value for this single-line JSONL file later on is this demonstration / tutorial.

4 Likes

Just decided to run the fine tuning again for n_epochs 8, 16 and 32 because I noticed I used different JSONL data when experimenting earlier, so for consistently, I just re-ran the fine-tunings and they are pending:

Pending Fine-Tunings for 8,16 and 32 n_epochs

Should take around a hour or so for all these fine-tunings to fully process and load, so I’m going to have dinner and will come back with the results after the models are cooked.

Don’t hold your breathe for lighting fast fine-tunings when you increase n_epochs! (Hahaha)

:slight_smile:

Appendix: Example Screen Grab Fine-Tuning with 32 n_epochs:

4 Likes

Update

Thirty minutes have passed and things are still processing smoothy at lightening fast speeds (joking).

I’m OK with the slowness because this is a “research beta” and I’m doing “research” and so my personal “completion temperature” is a cool, 60 degrees F :slight_smile:

My initial guess was it would take an hour to bake these cakes, so let’s see where we are in 30 minutes, shall we?

:frowning:

4 Likes

Looks like, as expected the 8 n_epochs model is ready to Rock n’ Ruby :slight_smile:

Let’s try it…

… and as expected … “The Bad”

Using 8 n_epochs gets us closer to what we are shooting for, but this was expected (since I have done this before today) and knew that 8 was not going to cut the mustard.

Any Gamblers?

Anyone care to wager on 16 n_epochs ? I’m confident 16 will get us very close, if not “right on the mark” .

We still need to wait for the 16 and 32 cakes to bake, so there is time to place your bets!

:slight_smile:

6 Likes

YAY! “The Good”

Over two hours and the 16 n_epochs cake has finally baked.

… and as expected, it’s a winner!

Completion Setup with 16 n_epochs fine-tuned model:

… and the winner is… a perfect completion.

Stand by for the 32 n_epochs cake to bake, even thought we all know the result will be great.

My guess it will take at least another hour for the 32 n_epochs shoe to drop, maybe two?

Time will tell…

5 Likes

“The Overkill”

Finally, the 32 n_epochs cake has baked and as expected, the results are solid.

32 n_epochs Results:

No surprises here.

All my tests has shown that 16 n_epochs work well and give the desired results.

So, what have we learned?

Well, we can say “with some authority” that fine tunings work, even single-line JSONL files as long as they are properly formatted and the n_epochs value is high enough.

So, when we read others posting that it is not possible to get good results fine tuning or fine tuning with only a few lines of training data, that is not accurate. It is possible, but as demonstrated the n_epochs value must be high.

This tutorial / demonstration did not address embeddings, but of course if I search the DB with using embedding vectors, I will get great results as well :slight_smile:

We can change the prompt as well, for example:

Do you have favorite color? or What color do you like?

… and we get good results for the fine-tuned model

as well as the vector search:

Closing Comments

We could re-fine-tune the 8 n_epochs an additional 8 n_epochs and see how that model compares to the 16 n_epochs model. We could run all kinds of tests, but I’m going to leave it alone and not add more test cases unless we see at least 100 likes in this topic :slight_smile:

Finally, if anyone had any doubt about fine-tuning or how to apply n_epochs to get better completion results, then this tutorial / demonstration should have erased all doubts. You can fine-tune a model with a single key-value pair and get good results if you correctly format your training data and crank up the n_epochs value high enough, there is no doubt.

HTH

:slight_smile:

13 Likes

I would love to see the findings on this experiment. Can’t wait to have my mind blown.

3 Likes

Nice tutorial.
Can you test the 32 epochs with a different prompt?
Something like this:
“Tell me what is your favorite color by naming an object with that color”.

1 Like

Thanks for the 1s-and-2s of how to go through this process! (^_^)

  • I appreciate all the insight and wisdom being shared, it’s greatly helpful for everyone that’s looking to learn.

I’m curious though: why would someone go through this when you might be able to accomplish very similar results through prompt engineering?

Curie

Babbage

Ada


  • If you’ve been following the topic of “fine-tuning” in the wild, you’ll have started to see posts, articles and papers from people who have been “in the know” for a few years - and a lot of them are saying something similar to:
    • Prompt engineering can do so much more than originally thought, and fine-tuning is more hassle than it’s worth
    • Things like few-shot learning combined with prompt engineering can take the place of fine-tuning.

I’m curious what you feel are the pros-cons of the fine-tuning process, compared to less resource intensive efforts?

4 Likes

I really liked seeing this experiment play out. One thing I have wondered is that how the epochs behave across all four models (ada, babbage, curie, davinci). Not that I need you to recreate this across the other three models, but my theory is that require more epochs as you get lower in model parameters. So ada would need more than babbage, which needs more than curie, etc.

Good work!

3 Likes

In my experience with GPT-2, the higher you go with epochs, the more you have to really start worrying about over-fitting… ie output from the model is verbatim from the training data… just something to think about and check after the fine-tuning…

Great tutorial, though. Well laid out…

6 Likes

Just to be devils advocate, I wonder if it is the opposite.

With the other models not being so clever, maybe they will be easier to overwhelm the existing LLM with less epoch passes of the data.

3 Likes

Good point on overfitting @PaulBellow

I have created my own classifiers that were overfitted which led to poor classifications down the line. But what does an overfitted GPT do? Does it just repeat the same thing over and over? If so, that could be funny in certain situations.

Also @raymonddavey good counter observation too. The less number of coefficients could easily get “burned in” with the higher number of epics.

6 Likes

It outputs strings of text verbatim from the training data rather than coming up with something new is my understanding.

4 Likes

For many applications this is a positive and not a negative: so it really depends on the domain.

5 Likes

Oh, for sure. I just brought it up as something to think about depending on your use case for the fine-tuning. Trying to be helpful and earn my stay around here! :wink:

I should try to write up my own tutorial on something soon…

5 Likes

Having a lot of code to write, projects and tasks, I plan to run more tests on this tutorial / experimental / demonstration topic, based on all the good feedback, when the “total topic like count” reaches 100, and it’s currently at 46.

:slight_smile:

2 Likes