Fine-Tuning In a Nutshell with a Single Line JSONL File and n_epochs

Hi @curt.kennedy

I ran the same fine-tuning tests with n_epochs 16 as follows:

In all cases, the results were “as expected” (dead on):

Here is the output for babbage, as an example. ada and curie were the same.




Hope this helps :slight_smile:

Note: I will run 8 for you next !

Send Bitcoin LOL :slight_smile: :+1:


Submitted the 8 n_epochs jobs last yesterday and woke up today with 2 of them failed; which is the first time this has happened, so we only got the results for ada at the moment.

As expected, the completion test using only 8 epochs showed very poor results:

ada (8 n_epochs)

… and have resubmitted babbage and curie and see if the cake will bake a second time around.



curie 8 has baked.

The results are better than ada but not by much :nerd_face:, mostly chatbot. babble

babbage 8 has baked

The results are as as poor as ada , just chatbot babble nonsense for babbage 8

It’s truly remarkable, the difference between 8 and 16 n_epochs is very clear.

Hope this helps @curt.kennedy and others.



@ruby_coder These results make me think that the decoder layer has a fixed size across all the models, since you get the same results for the same epoch. Which is surprising because they’re fine-tuned pricing exponential across the models. You would think it would be base model pricing plus a constant fee per 1k tokens. Maybe it’s just a coincidence, I don’t know.


Hi @curt.kennedy

Let me know if you want me to run some different tests.

I have plenty of OpenAI credits to run these experiments, compliments of OpenAI; and as you have seen in the results and screenshots, I have an easy to configure and use lab setup.

Suggestions on projects and experiments to improve the signal-to-noise ratio here in the developer community would be greatly appreciated.


1 Like

I did this test and the results were as bad as always.


Maybe start a new topic with your fine-tuning details, @georgei ?

The results will only be as good as your JSONL data (which we have not seen), the formatting (which we have not seen) and how you fine-tune (which we have not seen).

As this is a technical forum for developers, you should post your code (which we have not seen), your JSONL data (which we have not seen), your fine-tuning parameters (which we have not seen).

If you wish help (assuming you do want help fine-tuning), the kindly start a new topic with all your details so we can assist you.



1 Like

Got it!
I’ll try to avoid your topics then, in the future.

To clarify a bit my comment, here are some specs of what I did:

  1. I tried to replicate @ruby_coder parameters. Including the stop words, down to the blank character before “####” or “++++”. I haven’t tried the temperature he used, but it made no sense to use them since my use case was clearly failing with a more advantageous temperature.
  2. My goal was to use the data I provide in different circumstances than those provided by me in the fine-tuning file.

I’ll expand the second point a bit:

  • if you do fine-tuning and expect that GPT-3 to use your data in different ways than provided in the training file, is not going to work. My goal was to test if using of higher number of epochs will change this outcome.
  • a more concised example of what will not happen if you do fine-tuning with more epochs: prompt: “what is your favorite color?”, completion: “I’m in loved with red’”. After fine-tuning, if you ask GPT-3 the exact words from the prompt, it will answer correctly. If you ask GPT-3 to name an object with your favorite color, will throw a random response.
1 Like

That’s not what my reply said or implied. I simply stated that you should create your own topic if you have an issue and you need assistance, not hijack a topic with vague statements like “I did this test and the results were as bad as always.”, which does not make technical sense at all since we have no idea what you did since you refuse to post code, your JSONL file, parameter details or anything really that others can review.

Of course, if you completely change the prompt as you have done, it will not match. That’s just fine tuning basics and has little to do with this tutorial on how to use n_epochs. You simply changed the prompt to something dissimilar (which would not statically match in the decoder) to the original fine tuning prompt and then you say “it does not work” which has very little to do with the original topic. That is why I suggested you create a new topic is you need help fine tuning with a specific prompt.

The prompt does not have to be exact, but it should be close. I tested other word choices (not exact matches) and the results were fine. Your example completion prompt @georgei is statically different so it will require a different fine tuning prompt. It’s simple statistics.

Yes, because that was not how the fine-tuning was done. You are expecting a single-line fine-tuning to match a dissimilar text. GPTs generate text based on probabilities, not on inference. GPTs are not inference engines; they are “fancy” auto-completion text generators.

I think you @georgei have both changed the use case and the topic; because if you want to a single line fine tuning to match your example prompt, it’s easy to create it; and it really has very little to do with my n_epochs examples here. We can easily create a fine-tuning which with match both my original prompt and your challenge prompt, but the fine tuning will take two JSONL lines (two prompt-completion key-value pairs) and not one.

Honestly, @georgei I am sure you understand this already :slight_smile: but if not I will create a two line JSONL tuning file that will match both of our prompt examples and show you the detailed results.



I actually answered my own question which was on topic.
You said you don’t have time to do it.

1 Like

@ruby_coder Can you next test to see if your fine-tune is overfitted?

For example, see if it can generalize:
Put in slight semantic variations of the prompt and see what the completion is?

This could be a word difference or a casing difference. For example, here are different word and casing differences that result in different input tokens:

If it doesn’t generalize well, it’s overfitted (Looking at N=16 and N=32 here)

Also, can you re-run these for N=4 and N=8, but with a temperature of 0? Lowering the temperature may be all you need to get these to work without overfitting.


Hi @curt.kennedy

I expected it to be “overfitted”.

In my use case (as mentioned earlier in the topic), “overfitted” was a positive, not a negative and it was not really the goal of this topic.

I have already done this (but did not publish the results).

Slight variations had the same completion result as no variation (the original).


Sure, but severly overfitted just means you created a lookup table!

If the model can still respond reasonably well for a fine-tune with higher epochs, I think this is what the communities use case is, right? But if it falls apart, with slight wording differences, then this would be bad. The reason is, most inputs are not exact and uniform all the time, so generalization would be good here.


OK, that is what I was wondering. If so, this is good.

1 Like

I don’t think this is necessary the community use case, but I get your point.

The “community use case” I was responding to was people saying they cannot get a match at all (as stated in my first post) and that fine-tuning do not work and they get nothing but garbage.

You are creating a new “test for overfitting” use case to test which I think should be in a different topic.

This was also the reason I did not respond to @georgei first query because he was posting a prompt which I knew would not generate his expected results because the fine-tuned model was not fine-tuned for that prompt and so I thought it was off-topic of this discussion.

Perhaps I tend to be very focused? The title of this topic is:

Fine-Tuning In a Nutshell with a Single Line JSONL File and n_epochs


Fine-Tuning In a Nutshell with a Single Line JSONL File and n_epochs and testing for overfitting

In my mind, they are different topics and to be honest, I have not seen any community member complain about “overfitting” in the past 5 weeks since I’ve been active here and the subject seemed to have come up in here in this topic as a “theoretical”.

Also, I mentioned earlier that I thought “overfitting” could be positive or negative, based on the use case.



I think most people are under-fitting, which is a problem (default epoch is 4), and why your post shows how to “fix” this problem.

But the next concern, after everyone does N=16 or N=32 epoch fine-tunes is overfitting or possible overfitting. And this shows up as similar inputs not going to their expected outputs.

So the pendulum could sweep the other way!


Yes, exactly. Totally agree.

I can see your point as well as @georgei that they consider overfitting on topic in this tutorial, for sure.

But it going to become a very long topic if we add a lot of n_epochs overfitting tests, in my mind.



I shared previously my tests on this.
The exact match it is possible with many prompts/completions and a small number of epochs.

You proved that it is possible to obtain an exact match if the number of epochs is high.

That’s great for expenses when doing fine training.

You said you don’t have time.
It was just a prompt test - a minute maybe.
So I spent an hour to replicate your parameters and test myself. :upside_down_face:

1 Like


You proved that it is possible to obtain an exact match using a single line JSONL file if the number of epochs is high.

1 Like

Where did I say “I did not have time” related to your question, @georgei ?

I never replied to you at all (for the reason I just mentioned above, because I thought it was off topic so I just did not reply).

Are you replying using ChatGPT? You seem to be making thing up as you go, hahaha :slight_smile:

(just joking, don’t get upset… )


1 Like

Here ya go, @curt.kennedy, bunch of “slight variations”… as requested (16 n_epochs). Is this overfitting or not? Does not seem overfitted to me, especially in lieu of variation 5.

Variation 1

Variation 2

Variation 3

Variation 4

As you can clearly see, 'exact` matches are not required.

Here is a very interesting variation:

Variation 5: What is you favorit color?

I’m not a super expert on overfitting, but it does not seem “overfitted” to me.